SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies

Derya Aytan-Aktug*, Vladislav Grigorjev, Judit Szarvas, Philip T. L. C. Clausen, Patrick Munk, Marcus Nguyen, James J. Davis, Frank M. Aarestrup, Ole Lund

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

92 Downloads (Pure)

Abstract

High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains chal-lenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classi-fier to incomplete sequences, each complete sequence was subsampled into 5,000 nucle-otide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/).

IMPORTANCE Extra-chromosomal genes encoding antimicrobial resistance, metal resist-ance, and virulence provide selective advantages for bacterial survival under stress con-ditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative meas-ures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.

Original languageEnglish
JournalMicrobiology Spectrum
Volume10
Issue number6
Number of pages12
ISSN2165-0497
DOIs
Publication statusPublished - 2022

Keywords

  • Assembly identification
  • Bacteriophage
  • Chromosome
  • Machine learning
  • Plasmid
  • Source identification

Fingerprint

Dive into the research topics of 'SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies'. Together they form a unique fingerprint.

Cite this