Evergreen methods for phylogeny

Judit Szarvas*

*Corresponding author for this work

Research output: Book/ReportPh.D. thesisResearch

71 Downloads (Pure)


The emergence and adaptation of new technologies in DNA sequencing in the last decade led to the broader application of sequencing. Whole-genome sequencing of pathogens for food-safety, human and animal health has gained traction, and authorities of various countries have adopted the technology and made it part of their routine procedures. Genomic epidemiology, the use of genomic data in outbreak investigations and disease surveillance, has increased the number of solved outbreaks and reduced the cluster sizes, improving public health at a national level. Furthermore, the deluge of WGS data published in public repositories could be utilized as well. However, new software solutions were needed to provide an overview of this data, and going beyond that, to compare and analyse the samples of the same subtypes to detect potential outbreak clusters. My PhD is comprised of projects with the aim of developing scalable and continuous, i.e. ”evergreen”, computational methods for genomic epidemiology. This thesis introduces first the basics of subtyping in microbiology, DNA sequencing technologies and phylogenomics, to serve as background for the projects in genomic epidemiology. The first two included manuscripts centered around a pipeline for automated phylogenomic analysis of bacterial isolates (PAPABAC). As whole-genome sequencing data of bacterial isolates from clinical and environmental samples collected worldwide are being released by the thousands, comparison of these isolates with the aim of outbreak detection has become possible, by applying a scalable and autonomous method. PAPABAC was shown in Paper I to accurately cluster outbreak isolates of Escherichia coli, Campylobacter jejuni and Listeria monocytogenes benchmarking datasets, even in the company of a thousand non-outbreak related samples. Therefore, it was utilized in Evergreen Online, a platform for surveillance of foodborne outbreaks. Its performance on clustering was compared to the NCBI Pathogen detection pipeline, and found to be similar. In Paper II, we analysed WGS data from 5,655 methicillin resistant Staphylococcus aureus (MRSA), and 2,572 Enterococcus faecium patient isolates collected in Denmark over a 5-year period. PAPABAC underwent upgrades to handle unknown and missing bases more flexible. The SNP-based trees were in concordance with those achieved with cgMLST. For MRSA, epidemiological data were also available, and the PAPABAC clusters contained known nosocomial outbreaks and other epidemiologically linked isolates. The concept behind PAPABAC could be applied to other sequence data as well not just WGS data derived from bacteria. Paper III features Krummholz, a modified version of PAPABAC, suited to subtype and cluster consensus sequences obtained from viruses. To demonstrate the feasibility of the method, Krummholz trees were generated for norovirus VP1 sequences downloaded from Genbank. At 85% similarity threshold, the pipeline separated the VP1 sequences along their genogroups, creating an overview of the genetic diversity of the deposited sequences in Genbank. Displayed in Microreact, the krummholz trees of the genogroups GII.2, GII.3, GII.4 and GII.17 exhibit the temporal and spatial spread of samples collected during human gastroenteritis outbreaks. Thus, Krummholz was shown to aid and simplify the surveillance of viral pathogens, which is of high importance for vaccine- and diagnostics development. The final project, described in Paper IV, analysed whole-genome sequenced fungal patient isolates collected for a point prevalence survey from Danish Clinical Microbiology Laboratories in 2018. The 51 samples were typed, and thereafter thirty Candida albicans and eight Candida glabrata samples were subjected to phylogenetic analysis. No evident correlation was detected between phylogenetic placement and sampling institute or source of the Danish samples. The samples were also tested for antifungal susceptibility, and ultimately, all C. albicans samples were phenotypically susceptible to all tested antifungal agents, and only two C. glabrata samples were phenotypically resistant to azoles. The phylogenetic analysis was performed with a pipeline written for a popular workflow manager, ensuring repeatability and easing the analysis of additional samples. Thus, we also included samples collected worldwide, which resulted in phylogenies for both pathogens, where the Danish samples were placed in clades of globally prevalent subtypes. The thesis concludes in a chapter recapitulating the results of the projects and discussing aspects of the work that could be improved upon in the future, to progress further.
Original languageEnglish
Place of PublicationKgs. Lyngby
PublisherTechnical University of Denmark
Number of pages98
Publication statusPublished - 2021


Dive into the research topics of 'Evergreen methods for phylogeny'. Together they form a unique fingerprint.

Cite this