High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy

N. G. Nikolov*, E. B. Wedebye

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

1 Downloads (Orbit)

Abstract

The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.
Original languageEnglish
JournalSAR and QSAR in Environmental Research
ISSN1062-936X
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • QSAR
  • Post-hoc ensemble models
  • Hierarchical modelling
  • Cross-validation
  • DanishQSAR

Fingerprint

Dive into the research topics of 'High performance, large chemical coverage or both: DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy'. Together they form a unique fingerprint.

Cite this