NetSolP: predicting protein solubility in E. coli using language models

Vineet Thumuluri, Hannah-Marie Martiny, Jose J. Almagro Armenteros, Jesper Salomon, Henrik Nielsen, Alexander R Johansen*, Alfonso Valencia (Editor)

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

494 Downloads (Orbit)

Abstract

Motivation
Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased.

Results
In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.

Availability
The predictor and data are available at https://services.healthtech.dtu.dk/service.php?NetSolP and the open-sourced code is available at https://github.com/tvinet/NetSolP-1.0
Original languageEnglish
JournalBioinformatics
Volume38
Issue number4
Pages (from-to)941–946
Number of pages6
ISSN1367-4803
DOIs
Publication statusPublished - 2022

Fingerprint

Dive into the research topics of 'NetSolP: predicting protein solubility in E. coli using language models'. Together they form a unique fingerprint.

Cite this