Projects per year
Abstract
Microorganisms produce a wide array of natural products which includes antibiotics, pharmaceuticals, and other bioactive compounds. These small molecules are created through biosynthetic pathways encoded within their genomes. Also known as secondary or specialized metabolites, these compounds have important ecological roles for the producing organisms and hold significant potential for sustainable biotechnological applications.
Advances in genome sequencing technologies have led to the development of extensive genomic datasets and knowledge bases containing information on many proteins and pathways encoded in microbial genomes. However, navigating and utilizing this vast information to discover new natural products remains challenging.
This PhD thesis aimed to address these challenges by exploring and developing datadriven approaches for processing and analyzing largescale genomic datasets in order to accelerate the discovery of ecologically relevant and potentially novel natural products from bacteria.
The thesis first assesses the use of longread sequencing technology to mine secondary metabolite biosynthetic gene clusters (smBGCs) from complex environment samples, such as those found in wastewater treatment plants. The hybrid assembly method with longread sequencing technology was shown to be successful in increasing the generation of highquality metagenomeassembled genomes (MAGs), enabling the recovery of complete smBGCs that were previously difficult to identify using only shortread sequencing technologies.
As sequencing technology becomes more accessible and the number of genomic data continues to grow, a comprehensive workflow called BGCFlow was developed to integrate important bioinformatic tools for natural product genome mining and enable largescale pangenome studies. A case study on the genus Saccharopolyspora demonstrated the workflow’s capability to improve smBGC dereplication and generate novel insights by adding multiple layers of knowledge bases on top of gene cluster family (GCF) sequence similarity networks. The workflow’s scalability was further demonstrated by accommodating the analysis of 1034 Actinomycetota genomes, showcasing its potential to handle genome mining for largescale datasets.
Finally, the application of retrievalaugmented generation (RAG) using large language models (LLMs) was explored to simplify domainspecific information retrieval from complex smBGC knowledge bases such as antiSMASH. By automating the creation of a local antiSMASH SQL database from a userdefined collection of genomes and developing a training dataset of questionSQL pairs, the experiment demonstrated how RAGLLM systems could make sophisticated antiSMASHrelated queries more accessible to nonexperts.
In conclusion, this PhD thesis presents a series of studies, including novel tools and datadriven approaches, to advance natural product genome mining from largescale genomic datasets. By leveraging longread sequencing, automated workflows, and stateoftheart data analytics powered by artificial intelligence (AI), the work on this thesis offers new pathways for the efficient discovery of ecologically and industrially relevant natural products from bacterial (meta) genomes.
Advances in genome sequencing technologies have led to the development of extensive genomic datasets and knowledge bases containing information on many proteins and pathways encoded in microbial genomes. However, navigating and utilizing this vast information to discover new natural products remains challenging.
This PhD thesis aimed to address these challenges by exploring and developing datadriven approaches for processing and analyzing largescale genomic datasets in order to accelerate the discovery of ecologically relevant and potentially novel natural products from bacteria.
The thesis first assesses the use of longread sequencing technology to mine secondary metabolite biosynthetic gene clusters (smBGCs) from complex environment samples, such as those found in wastewater treatment plants. The hybrid assembly method with longread sequencing technology was shown to be successful in increasing the generation of highquality metagenomeassembled genomes (MAGs), enabling the recovery of complete smBGCs that were previously difficult to identify using only shortread sequencing technologies.
As sequencing technology becomes more accessible and the number of genomic data continues to grow, a comprehensive workflow called BGCFlow was developed to integrate important bioinformatic tools for natural product genome mining and enable largescale pangenome studies. A case study on the genus Saccharopolyspora demonstrated the workflow’s capability to improve smBGC dereplication and generate novel insights by adding multiple layers of knowledge bases on top of gene cluster family (GCF) sequence similarity networks. The workflow’s scalability was further demonstrated by accommodating the analysis of 1034 Actinomycetota genomes, showcasing its potential to handle genome mining for largescale datasets.
Finally, the application of retrievalaugmented generation (RAG) using large language models (LLMs) was explored to simplify domainspecific information retrieval from complex smBGC knowledge bases such as antiSMASH. By automating the creation of a local antiSMASH SQL database from a userdefined collection of genomes and developing a training dataset of questionSQL pairs, the experiment demonstrated how RAGLLM systems could make sophisticated antiSMASHrelated queries more accessible to nonexperts.
In conclusion, this PhD thesis presents a series of studies, including novel tools and datadriven approaches, to advance natural product genome mining from largescale genomic datasets. By leveraging longread sequencing, automated workflows, and stateoftheart data analytics powered by artificial intelligence (AI), the work on this thesis offers new pathways for the efficient discovery of ecologically and industrially relevant natural products from bacterial (meta) genomes.
Original language | English |
---|
Publisher | Technical University of Denmark |
---|---|
Number of pages | 172 |
Publication status | Published - 2024 |
Fingerprint
Dive into the research topics of 'Advancing Natural Products Genome Mining through Large Scale DataDriven Approaches'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Data-Driven Natural Products Discovery in Soil Microbial Interaction
Nuhamunada, M. (PhD Student), Weber, T. (Main Supervisor), Mohite, O. S. (Supervisor), Arumugam, M. (Supervisor), Pupin, M. (Examiner) & Udwary, D. (Examiner)
01/09/2021 → 14/01/2025
Project: PhD