The effect of rational selection of training sets from an imbalanced AhR activation dataset on QSAR models accuracy and applicability domain coverage for a large set of REACH substances

Kyrylo Oleksandrovych Klimenko, Sine Abildgaard Rosenberg, Marianne Dybdahl, Eva Bay Wedebye, Nikolai Georgiev Nikolov

Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

21 Downloads (Pure)

Abstract

The aryl hydrocarbon receptor (AhR) is a ligand-dependent transcription factor that regulates the expression of multiple genes of importance for among other things organ development, the immune system and the metabolism of exogenous and endogenous small molecules. AhR activation by industrial chemical substances may lead to increased turnover of the endogenous estrogen and thyroid hormones, possibly resulting in adverse outcomes.
A PubChem experimental data set on AhR activation with 324,858 chemical substances which is heavily skewed towards inactives was used to develop QSAR models using a stepwise rational training set selection approach. After randomly selecting equal proportions of actives and inactives to make initial models, predictions of large external inactive selection sets were made and used to rationally select and add inactives to the training sets. This was done in an iterative process to produce final models. Two approaches were taken to select additional training set compounds: in the first approach substances were added that were either predicted incorrectly as positives or were out of structural or probability applicability domain, and in the second approach substances were added with a more focused scope to optimize the applicability domain for REACH substances. Final models resulting from both approaches were used to predict approximately 80,000 REACH industrial chemical substances. The advantages and applicability of each approach to predicting potential endocrine disruptors are discussed.
Original languageEnglish
Publication date2018
Number of pages1
Publication statusPublished - 2018
EventQSAR2018: 18th International Conference on QSAR in Environmental and Health Sciences - Rikli balance hotel , Bled, Slovenia
Duration: 11 Jun 201815 Jun 2018
Conference number: 18th
http://www.qsar2018.com/

Conference

ConferenceQSAR2018
Number18th
LocationRikli balance hotel
CountrySlovenia
CityBled
Period11/06/201815/06/2018
Internet address

Cite this

@conference{91c050b92db74d17bdf3843200e457c0,
title = "The effect of rational selection of training sets from an imbalanced AhR activation dataset on QSAR models accuracy and applicability domain coverage for a large set of REACH substances",
abstract = "The aryl hydrocarbon receptor (AhR) is a ligand-dependent transcription factor that regulates the expression of multiple genes of importance for among other things organ development, the immune system and the metabolism of exogenous and endogenous small molecules. AhR activation by industrial chemical substances may lead to increased turnover of the endogenous estrogen and thyroid hormones, possibly resulting in adverse outcomes. A PubChem experimental data set on AhR activation with 324,858 chemical substances which is heavily skewed towards inactives was used to develop QSAR models using a stepwise rational training set selection approach. After randomly selecting equal proportions of actives and inactives to make initial models, predictions of large external inactive selection sets were made and used to rationally select and add inactives to the training sets. This was done in an iterative process to produce final models. Two approaches were taken to select additional training set compounds: in the first approach substances were added that were either predicted incorrectly as positives or were out of structural or probability applicability domain, and in the second approach substances were added with a more focused scope to optimize the applicability domain for REACH substances. Final models resulting from both approaches were used to predict approximately 80,000 REACH industrial chemical substances. The advantages and applicability of each approach to predicting potential endocrine disruptors are discussed.",
author = "Klimenko, {Kyrylo Oleksandrovych} and {Abildgaard Rosenberg}, Sine and Marianne Dybdahl and Wedebye, {Eva Bay} and Nikolov, {Nikolai Georgiev}",
year = "2018",
language = "English",
note = "QSAR2018 : 18th International Conference on QSAR in Environmental and Health Sciences, QSAR2018 ; Conference date: 11-06-2018 Through 15-06-2018",
url = "http://www.qsar2018.com/",

}

The effect of rational selection of training sets from an imbalanced AhR activation dataset on QSAR models accuracy and applicability domain coverage for a large set of REACH substances. / Klimenko, Kyrylo Oleksandrovych; Abildgaard Rosenberg, Sine; Dybdahl, Marianne; Wedebye, Eva Bay; Nikolov, Nikolai Georgiev.

2018. Abstract from QSAR2018, Bled, Slovenia.

Research output: Contribution to conferenceConference abstract for conferenceResearchpeer-review

TY - ABST

T1 - The effect of rational selection of training sets from an imbalanced AhR activation dataset on QSAR models accuracy and applicability domain coverage for a large set of REACH substances

AU - Klimenko, Kyrylo Oleksandrovych

AU - Abildgaard Rosenberg, Sine

AU - Dybdahl, Marianne

AU - Wedebye, Eva Bay

AU - Nikolov, Nikolai Georgiev

PY - 2018

Y1 - 2018

N2 - The aryl hydrocarbon receptor (AhR) is a ligand-dependent transcription factor that regulates the expression of multiple genes of importance for among other things organ development, the immune system and the metabolism of exogenous and endogenous small molecules. AhR activation by industrial chemical substances may lead to increased turnover of the endogenous estrogen and thyroid hormones, possibly resulting in adverse outcomes. A PubChem experimental data set on AhR activation with 324,858 chemical substances which is heavily skewed towards inactives was used to develop QSAR models using a stepwise rational training set selection approach. After randomly selecting equal proportions of actives and inactives to make initial models, predictions of large external inactive selection sets were made and used to rationally select and add inactives to the training sets. This was done in an iterative process to produce final models. Two approaches were taken to select additional training set compounds: in the first approach substances were added that were either predicted incorrectly as positives or were out of structural or probability applicability domain, and in the second approach substances were added with a more focused scope to optimize the applicability domain for REACH substances. Final models resulting from both approaches were used to predict approximately 80,000 REACH industrial chemical substances. The advantages and applicability of each approach to predicting potential endocrine disruptors are discussed.

AB - The aryl hydrocarbon receptor (AhR) is a ligand-dependent transcription factor that regulates the expression of multiple genes of importance for among other things organ development, the immune system and the metabolism of exogenous and endogenous small molecules. AhR activation by industrial chemical substances may lead to increased turnover of the endogenous estrogen and thyroid hormones, possibly resulting in adverse outcomes. A PubChem experimental data set on AhR activation with 324,858 chemical substances which is heavily skewed towards inactives was used to develop QSAR models using a stepwise rational training set selection approach. After randomly selecting equal proportions of actives and inactives to make initial models, predictions of large external inactive selection sets were made and used to rationally select and add inactives to the training sets. This was done in an iterative process to produce final models. Two approaches were taken to select additional training set compounds: in the first approach substances were added that were either predicted incorrectly as positives or were out of structural or probability applicability domain, and in the second approach substances were added with a more focused scope to optimize the applicability domain for REACH substances. Final models resulting from both approaches were used to predict approximately 80,000 REACH industrial chemical substances. The advantages and applicability of each approach to predicting potential endocrine disruptors are discussed.

M3 - Conference abstract for conference

ER -