Accurate prediction of protein stability upon mutation enables rational engineering of new proteins and insights into protein evolution and monogenetic diseases caused by single-point amino acid substitutions. Many tools have been developed to this aim, ranging from energy-based models to machine-learning methods that use large amounts of experimental data. However, as the methods become more complex, the interpretation of the chemistry underlying the protein stability effects becomes obscure. It is thus of interest to identify the simplest prediction model that retains complete amino acid specific interpretation; for a given number of input descriptors, we expect such a model to be almost universal. In this study, we identify such a limiting model, SimBa, a simple multilinear regression model trained on a substitution-type-balanced experimental data set. The model accounts only for the solvent accessibility of the site, volume difference, and polarity difference caused by mutation. Our results show that this very simple and directly applicable model performs comparably to other much more complex, widely used protein stability prediction methods. This suggests that a hard limit of ∼1 kcal/mol numerical accuracy and an R ∼ 0.5 trend accuracy exists and that new features, such as account of unfolded states, water colocalization, and amino acid correlations, are required to improve accuracy to, e.g., 1/2 kcal/mol.
- Peptides and proteins