Whole Genome Epidemiological Typing of Escherichia coli

Research output: Book/ReportPh.D. thesis

1145 Downloads (Pure)


Escherichia coli (E. coli) is of huge importance in global health both as a commensal organism living within its host or as a pathogen causing millions of infections each year. Infections occur both sporadic and as outbreaks with sometimes up to thousands of infected people. To limit the number of infections it is important to monitor pathogenic E. coli in order to detect outbreaks as quickly as possible and find the source of the outbreak. The effectiveness of monitoring and tracking of pathogens is very dependent on the typing methods that are employed. Classical typing methods employed for E. coli is in general expensive and to some extent unreliable. Next generation sequencing has quickly become a tool widely available and has enabled even smaller laboratories to do whole genome sequencing (WGS). Having the entire genome available provides the opportunity to create the ultimate typing method. This PhD thesis attempts to take the first steps toward such a method.
In Kaas I all publicly available E. coli genomes sequenced (186) are analyzed. 1,702 core genes were found in all genomes. 3,051 genes were found in 95% of the genomes. The pan genome was found to consist of 16,373 genes. The overall phylogeny was inferred from the core genome and also set into context of the Escherichia genus. The variance within each gene cluster was calculated in order to compare the variance between genes and possibly identify typing targets for further study. The variance scores calculated was also used to compare the three MLST schemes that exist for E. coli.
It quickly became clear that single nucleotide polymorphism (SNP) analysis was becoming the method of choice for inferring the phylogeny of bacterial outbreaks. However, the method remained unavailable to many people due to technical obstacles. In Kaas II we describe the SNP method and the validation behind a web server that we set up in order to overcome some of the technical obstacles faced by many people and thereby making the method more available. The method briefly, calls SNPs against a specified reference sequence, creates an alignment (pseudosequence) of all the SNPs, and uses the maximum likelihood (ML) method to create a tree. The most important detail in the method is the assumption made about “missing” SNPs. Meaning SNPs called in one strain but not in another. It was assumed that SNPs not found in a position was due to that nucleotide being identical to the one in the reference sequence. The assumption is in general valid if all the genomes compared are closely related and the sequencing data is of good quality.
In Kaas III we sought to overcome the assumption mentioned above but most important of all we wanted to create a method that could handle sequence data obtained from different sequencing technologies. The method from Kaas II was completely rewritten and a new web server (CSI Phylogeny) was published that could handle sequence data of all kinds and no longer made assumptions about missing SNPs. Very briefly, the method differs from Kaas II mainly by validating all the locations in all the genomes in which a SNP has been called in any genome. In parallel to the development of a new SNP method another method was also developed that briefly, relies on counting nucleotide differences (ND) between each genome pair, while also validating each position analyzed and ignoring the positions that cannot be validated thereby creating a distance matrix that is used as input to an UPGMA method that creates the final phylogeny. The ND method was also implemented as a web server and published.
If whole genome sequencing is to be used for routine monitoring and tracking of E. coli pathogens, it is crucial to have an idea of how large the difference is between isolates from the same outbreak, compared to the difference to other non-outbreak isolates, in order to do reliable distinctions. In Kaas IV we analyzed ten different outbreaks. Seven of the outbreaks were sequenced for the study and three of the outbreaks were obtained from published studies. Several background isolates that resembled the outbreak isolates were also sequenced. Five different bioinformatic methods were evaluated against the 10 outbreaks. The five different methods were based on SNP, ND, core genes, k-mers, and average nucleotide identity (ANI). Only the ANI method was not able to cluster all outbreaks correctly. The pairwise distance between all isolates were also calculated by each method and compared. Most methods showed lower distance between isolates in the same outbreak compared to the background strains, but only the SNP method was able to set one common threshold for outbreak isolates versus non-outbreak isolates for the entire dataset.
Whole genome sequencing is a powerful but also a rather new tool. This PhD thesis has hopefully shed some light on how we can continue development of whole genome sequence typing and also made WGS more available to a broader audience.
Original languageEnglish
Place of PublicationKgs. Lyngby
PublisherTechnical University of Denmark
Number of pages126
Publication statusPublished - 2014


Dive into the research topics of 'Whole Genome Epidemiological Typing of Escherichia coli'. Together they form a unique fingerprint.

Cite this