Systematic Investigation of the Data Set Dependency of Protein Stability Predictors

Octav Caldararu, Rukmankesh Mehra, Tom L Blundell, Kasper Planeta Kepp*

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Prediction of protein stability changes caused by mutation is of major importance to protein engineering and for understanding protein misfolding diseases and protein evolution. The major limitation to these applications is the fact that different prediction methods vary substantially in terms of performance for specific proteins; i.e., performance is not transferable from one type of mutation or protein to another. In this study, we investigated the performance and transferability of eight widely used methods. We first constructed a new data set composed of 2647 mutations using strict selection criteria for the experimental data and then defined a variety of subdata sets that are unbiased with respect to various aspects such as mutation type, stabilization extent, structure type, and solvent exposure. Benchmarking the methods against these subdata sets enabled us to systematically investigate how data set biases affect predictor performance. In particular, we use a reduced amino acid alphabet to quantify the bias toward mutation type, which we identify as the major bias in current approaches. Our results show that all prediction methods exhibit large biases, stemming not from failures of the models applied but mostly from the selection biases of experimental data used for training or parametrization. Our identification of these biases and the construction of new mutation-type-balanced data should lead to the development of more balanced and transferable prediction methods in the future.

Original languageEnglish
JournalJournal of Chemical Information and Modeling
Volume60
Issue number10
Pages (from-to)4772–4784
ISSN1549-9596
DOIs
Publication statusPublished - 2020

Fingerprint Dive into the research topics of 'Systematic Investigation of the Data Set Dependency of Protein Stability Predictors'. Together they form a unique fingerprint.

Cite this