Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species

Siddharth M. Chauhan, Omid Ardalani, Jason C. Hyun, Jonathan M. Monk, Patrick V. Phaneuf, Bernhard O. Palsson

Research output: Contribution to journalJournal articleResearchpeer-review

2 Downloads (Pure)

Abstract

Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups' differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in
Original languageEnglish
Article numbere0053224
JournalmSphere
Volume10
Issue number1
ISSN1535-9778
DOIs
Publication statusPublished - 2025

Keywords

  • Escherichia coli
  • Shigella
  • Computational biology
  • Genome analysis
  • Genomics
  • Typing

Fingerprint

Dive into the research topics of 'Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species'. Together they form a unique fingerprint.

Cite this