Metagenomic data stratied using articial intelligence

Jakob Nybo Nissen

Research output: Book/ReportPh.D. thesis

257 Downloads (Pure)


Metagenomics is the research field pertaining to the analysis of genetic material taken directly from an environment as opposed to the tissue of a single organism. Metagenomics have a plethora of use cases: Analysis of the human gut microbiome is a metagenomic task, and is getting an increasing amount of attention these years. Medical- and biotech-companies do metagenomic analysis of environments to discover microorganisms with industrially relevant genes. At the time of writing, the global 2020 coronavirus pandemic is raging, caused by a virus whose origin seem to have been uncovered by a metagenomic study. As in many other sprawling fields, a great number of analytical tools are available for metagenomic researchers. One of these tools is metagenomic binning, a process by which genetic sequences from an environment is grouped, or binned, such that each resulting bin is presumed to correspond to a genome from a single organism in the environment. Binning has been used in a number of high-profile articles the last years. Despite the quick pace of progress within the field, binning remains an error-prone process whose results leave much room for improvement.
The work presented in this thesis is about method development within metagenomics, and in particular metagenomic binning. This thesis is composed primarily of two articles written during my Ph.D scholarship, each describing a specific contribution to the metagenomic toolkit. Besides the articles, the thesis consists of introduction and discussion sections which puts the articles in context, and which contains relevant results that could not be included in the articles due to space concerns. The contributions of this thesis can be summarized in brief:

1. In the first article we present Vamb, a new method for binning, as well as the software implementing the method. Vamb uses variational autoencoders to represent metagenomic sequences before the representation is clustered using a novel homemade algorithm. We use Vamb to group a collection of synthetic metagenomes and thus demonstrate that Vamb creates more accurate bins than comparable software. By binning a large natural dataset with 1,000 human feces samples and almost 6 million contigs, we show that Vamb can handle larger datasets than other binners. We also show that Vamb can recreate bacterial strains with high phylogenetic resolution.

2. The second article is a comparison between the domain-specific language Seq and BioJulia, a package for bioinformatic data analysis. The comparison refutes central claims in an article published in 2019 by reproducing the results from the original article and shows how these results do not support claims in the article. This article illustrates the possibilities of BioJulia, a package I have contributed to the development of. The tools behind these articles are meant for different parts of a complete metagenomic workflow. Vamb is a specific tool for one single part of the workflow, and may directly subsitute other binning tools. In contrast, BioJulia is of a more general nature, and creates a solid foundation for metagenomic tool development. Together, these articles represent a small contribution to metagenomic methods.
Original languageEnglish
PublisherDTU Health Technology
Number of pages130
Publication statusPublished - 2020


Dive into the research topics of 'Metagenomic data stratied using articial intelligence'. Together they form a unique fingerprint.

Cite this