TY - JOUR
T1 - High performance, large chemical coverage or both
T2 - DanishQSAR and hierarchies of post-hoc ensemble models optimized for sensitivity, specificity or balanced accuracy
AU - Nikolov, N. G.
AU - Wedebye, E. B.
PY - 2025
Y1 - 2025
N2 - The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.
AB - The trade-off between applicability domain size and prediction accuracy is a well-known phenomenon in QSAR. We have developed a modelling approach where multiple models with different applicability domain sizes and with different prediction accuracy are selected instead of a single best model. This approach is implemented in DanishQSAR, a new software for binary classification QSAR modelling, integrating descriptor calculation, descriptor selection, model development, validation and application. The various methods and options available in the software are automatically tested and efficiently combined during model development using a version of cross-validation-based grid search and post-hoc ensemble modelling. The resulting large and diverse pool of model candidates is then analysed to generate three hierarchies of models, optimized for sensitivity, specificity or balanced accuracy, respectively, for minimum to maximum coverage levels. When predicting a query compound, the system provides predictions from all models in the three hierarchies, at all coverage levels with user-defined steps, together with the individual model predictivity performances, producing a prediction profile rather than one prediction from a single model. Twenty data sets from the Danish (Q)SAR Database (https://qsar.food.dtu.dk) are used to demonstrate the performance. The developed binary classification models are highly accurate by cross- and external validation.
KW - QSAR
KW - Post-hoc ensemble models
KW - Hierarchical modelling
KW - Cross-validation
KW - DanishQSAR
U2 - 10.1080/1062936X.2025.2510964
DO - 10.1080/1062936X.2025.2510964
M3 - Journal article
C2 - 40462635
SN - 1062-936X
JO - SAR and QSAR in Environmental Research
JF - SAR and QSAR in Environmental Research
ER -