Machine Learning and Deep Learning Applications for New and Improved Property Predictions

Research output: Book/ReportPh.D. thesis

268 Downloads (Orbit)

Abstract

Molecules and chemical compounds are not only vital for any chemical engineering application, but they are in fact also essential building blocks for the sustainability of our society. These chemical products can be found in a wide range of fields and applications covering agriculture, medicine, food, sanitation, and various industrial processes. Considering their widespread uses, it comes as no surprise that they play an integral part in creating a more sustainable future. The identification of chemical products that are suitable and capable of providing the desired functionality falls under the field of chemical product design. Whether a molecule or a chemical compound can satisfy the requirements set is directly related to its properties. These properties can be of various categories such as thermal, physical, toxicological, transport, safety, or environmental related. The determination of these properties is usually done based on an experimental approach. However, considering the sheer size of the chemical design space this is impractical as such it is vital to devise alternative approaches to infer these properties.

Quantitative structure-property relations, referred to as QSPRs, are mathematical models that take a numerical translation of the structural information of the molecule as input to a mathematical model to infer the value of a given property. The in-silico quantification of various molecular properties enables a more directed and focused experimental approach by only considering the most promising candidates. Current QSPR models used in the chemical engineering field are largely based on the concept of the linear additive group-contribution (GC) models. These models use a predefined set of rules to segment a molecule into its constituent functional groups and represent the molecule in terms of the occurrence of the groups. While these models have a long list of advantages in the form of them being simple, interpretable, “accurate”, computationally fast and inexpensive, and partially rooted in a fundamental understanding of the nature of the property as well as the large span of properties on which they have been applied to. However, GC models are also inherently faulted, their simple nature prevents them from describing more complex phenomena that have non-linear trends, they do not consider proximity effects of adjacent groups, and the contributions of a group can only be determined if the group is present in the data and as such their extrapolative ability is limited and largely untested during the model development.

Improving QSPRs is largely related to three elements: data, representation, and correlation. Most attention has been turned towards improving the structural representation and the underlying models that related these to the target properties. This has been largely supported by recent advancements in the fields of machine learning (ML) and artificial intelligence (AI) and more specifically advances related to feature extraction and functional approximation. Graph neural networks (GNNs) offer an end-to-end framework (thanks to the backpropagation algorithm) that is capable of “learning” a molecular representation of an abstraction of the molecular graph where nodes represent atoms and edges represent bonds. Both nodes and edges are embedded and are attributed with a feature vector that contains information about the atoms and chemical bonds respectively. While these models do eliminate the need for defining groups and are capable of approximating any function (universal approximation theory), they also come with a set of defects: they are high parametric models, they are black-box models, and the uncertainty quantification on the model output is largely taken for granted and their domain application for properties of interest in the chemical engineering field is still unexplored. This work aims to address some of these points.

The first research question addresses the extent current property modeling can be improved through outlier treatment, investigating alternative functional approximation techniques by substituting the linear additivity part of the model with ML-based models in the form of Gaussian processes and artificial neural networks. The study showed that the outlier treatment comes with a drawback in the form of eliminating groups and consequently their contribution. While ML-based approaches for functional approximation do provide better correlations, they are ultimately limited by the nature of the representation used.

The second research question addresses the gap between models “rooted” in a fundamental understanding of the properties in the form of GC models and the data-driven approach. This is done by adding an aspect of interpretability of GNN models by combining the attention mechanism with prior knowledge in the form of the functional groups defined as part of the GC models. The developed models (GroupGAT and attentive GC) showcased insights that are - in most – in agreement with those from GC models. Furthermore, the models showcased competitive accuracy with state-of-the-art GNN-based models in the literature.

The third research question addresses the different approaches to quantify the uncertainty in the model prediction of GNN-based models. Here, we highlighted three techniques in the form of ensembling, last-layer Monte Carlo dropout, and bootstrap. The different techniques showed different results and a varying degree of confidence in their prediction which is highly related to the data used and the weight initialization. The study concludes that ensemble and bootstrap are the most widely adopted techniques.

The fourth (and last) research question addresses the extent developed GNN models can model the properties of interest in the chemical engineering field. The study covered 30 properties with different applications and showcased that GNN models are a “one model fits all” approach that provides state-of-the-art accuracy compared to widely used GC models for chemical product design.

Considering the intensity and pace of GNN model development, there is an increasing lack in the level of empirical rigor, which might limit the understanding of how such data-driven models operate. Therefore, we address finally the future direction of property modeling to further extend their domain of applicability and accuracy.
Original languageEnglish
Place of PublicationKgs. Lyngby
PublisherTechnical University of Denmark
Number of pages197
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'Machine Learning and Deep Learning Applications for New and Improved Property Predictions'. Together they form a unique fingerprint.

Cite this