TY - JOUR
T1 - Decomposition of the pangenome matrix reveals a structure in gene distribution in the Escherichia coli species
AU - Chauhan, Siddharth M.
AU - Ardalani, Omid
AU - Hyun, Jason C.
AU - Monk, Jonathan M.
AU - Phaneuf, Patrick V.
AU - Palsson, Bernhard O.
PY - 2025
Y1 - 2025
N2 - Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups' differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in
AB - Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of Escherichia coli for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively. We developed a machine learning approach to define the accessory genes characterizing the major phylogroups of E. coli plus Shigella: A, B1, B2, C, D, E, F, G, and Shigella. The analysis resulted in a detailed structure of the genetic basis of the phylogroups' differential traits. This pangenome structure was largely consistent with a housekeeping-gene-based MLST distribution, sequence-based Mash distance, and the Clermont quadruplex classification. The rare genome (consisting of genes found in
KW - Escherichia coli
KW - Shigella
KW - Computational biology
KW - Genome analysis
KW - Genomics
KW - Typing
U2 - 10.1128/msphere.00532-24
DO - 10.1128/msphere.00532-24
M3 - Journal article
C2 - 39745367
SN - 1535-9778
VL - 10
JO - mSphere
JF - mSphere
IS - 1
M1 - e0053224
ER -