Data set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable models

Kristoffer T. Bæk, Kasper P. Kepp*

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

105 Downloads (Orbit)

Abstract

Accurate prediction of protein stability changes upon mutation (ΔΔG) is increasingly important to evolution studies, protein engineering, and screening of disease-causing gene variants but is challenged by biases in training data. We investigated 45 linear regression models trained on data sets that account systematically for destabilization bias and mutation-type bias BM. The models were externally validated on three test data sets probing different pathologies and for internal consistency (symmetry and neutrality). Model structure and performance substantially depended on training data and even fitting method. We developed two final models: SimBa-IB for typical natural mutations and SimBa-SYM for situations where stabilizing and destabilizing mutations occur to a similar extent. SimBa-SYM, despite is simplicity, is essentially non-biased (vs. the Ssym data set) while still performing well for all data sets (R ~ 0.46-0.54, MAE = 1.16-1.24 kcal/mol). The simple models provide advantage in terms of interpretability, use and future improvement, and are freely available on GitHub.
Original languageEnglish
JournalJournal of Computational Chemistry
Volume43
Issue number8
Pages (from-to)504-518
Number of pages15
ISSN0192-8651
DOIs
Publication statusPublished - 2022

Keywords

  • Computer models
  • Data set bias
  • Linear regression
  • Mutation
  • Protein stability

Fingerprint

Dive into the research topics of 'Data set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable models'. Together they form a unique fingerprint.

Cite this