Projects per year
Abstract
In modern applications, there are found several operational data storage systems and large amounts of heterogeneous data that is being collected, both in business and science contexts. At the same time, the data generation is in general error prone, meaning that the data entry process always will produce dirty data to some extent, either caused by human or system failure. Data integration comprises the task of cleaning dirty data and reconciling the different data sources into one homogeneous data set, which is a crucial step on the way to developing big data applications, such as machine learning models that rely on a vast amount of data. However, the pace of data creation has by far exceeded the capability of current data integration approaches, as these often rely on domain experts. As a consequence, a large fraction of valuable data is not utilized for analysis and thus leaves unused potential in every business field, which needs to be addressed. Regarding this, this thesis seeks to develop data integration
methods for big data applications by employing machine learning algorithms to ease the data integration problem in domain specific contexts, which is covered by a threefold contribution. First, this thesis has provided research contributions that investigated algorithms in the area of data fusion and relational machine learning with a focus on their suitability to solve data integration challenges. A case study of an enzyme producer in the process industry that involved multiple data sources evidently showed that before the actual task of developing a machine learning solution, multiple data sources require careful data integration steps, which are inherently hard. Many of the data integration challenges are complex, because they require relational knowledge of the data at hand that often goes beyond the data source or is not even contained in the data itself, which often leads to the necessity of involving domain experts, who possess the required knowledge. To accommodate this requirement of understanding relational knowledge this thesis has covered three research contributions in the field of relational machine learning. This includes the Bayesian Cut, which is a specialized model for community detection in graphs, as well as two contributions that investigate knowledge graph embedding models in greater detail. In particular, a knowledge graph embedding model framework was developed that enables composability
and reproducibility and was accompanied by a largescale benchmarking study, which made the contribution of individual components transparent, such that the necessary design decisions are more comprehensible and accessible. Furthermore, a new measure was proposed that alleviates the conceptually flawed way of measuring the performance of knowledge graph embedding models, which is
vital when relying on knowledge graph embedding models in big data applications. Second, as knowledge graph embedding models are suitable knowledge integrators that can be used to predict new knowledge based on relations between objects, this thesis proposed a data integration framework purely based on machine learning that allows the combination of other machine learning approaches
with knowledge graph embedding models as an artificial domain expert to solve the abovementioned data integration challenges. In the proofofconcept experiments that were covered in this thesis, this approach has been shown to be a very suitable approach that has promising merits. These merits include that it can scale vastly while it can accommodate various existing data profiling approaches and various types of information that are handled through the knowledge graph embedding model in a relational fashion. Third, all research contributions as well as the experiments are based on research
software that was created as a part of this thesis, which is openly available, accompanied by journal publications and thoroughly documented, and thus fosters future research in the covered research areas and beyond as well as applications that build on it, such as the data integration framework proposed in this thesis.
methods for big data applications by employing machine learning algorithms to ease the data integration problem in domain specific contexts, which is covered by a threefold contribution. First, this thesis has provided research contributions that investigated algorithms in the area of data fusion and relational machine learning with a focus on their suitability to solve data integration challenges. A case study of an enzyme producer in the process industry that involved multiple data sources evidently showed that before the actual task of developing a machine learning solution, multiple data sources require careful data integration steps, which are inherently hard. Many of the data integration challenges are complex, because they require relational knowledge of the data at hand that often goes beyond the data source or is not even contained in the data itself, which often leads to the necessity of involving domain experts, who possess the required knowledge. To accommodate this requirement of understanding relational knowledge this thesis has covered three research contributions in the field of relational machine learning. This includes the Bayesian Cut, which is a specialized model for community detection in graphs, as well as two contributions that investigate knowledge graph embedding models in greater detail. In particular, a knowledge graph embedding model framework was developed that enables composability
and reproducibility and was accompanied by a largescale benchmarking study, which made the contribution of individual components transparent, such that the necessary design decisions are more comprehensible and accessible. Furthermore, a new measure was proposed that alleviates the conceptually flawed way of measuring the performance of knowledge graph embedding models, which is
vital when relying on knowledge graph embedding models in big data applications. Second, as knowledge graph embedding models are suitable knowledge integrators that can be used to predict new knowledge based on relations between objects, this thesis proposed a data integration framework purely based on machine learning that allows the combination of other machine learning approaches
with knowledge graph embedding models as an artificial domain expert to solve the abovementioned data integration challenges. In the proofofconcept experiments that were covered in this thesis, this approach has been shown to be a very suitable approach that has promising merits. These merits include that it can scale vastly while it can accommodate various existing data profiling approaches and various types of information that are handled through the knowledge graph embedding model in a relational fashion. Third, all research contributions as well as the experiments are based on research
software that was created as a part of this thesis, which is openly available, accompanied by journal publications and thoroughly documented, and thus fosters future research in the covered research areas and beyond as well as applications that build on it, such as the data integration framework proposed in this thesis.
Original language  English 

Publisher  Technical University of Denmark 

Number of pages  186 
Publication status  Published  2022 
Fingerprint
Dive into the research topics of 'Data Integration for Industrial Big Data Applications'. Together they form a unique fingerprint.Projects
 1 Finished

Big Data Analystics with special emphasis on Food Supply Chain Data
Vermue, L., Ersbøll, B. K., Hansen, L. K., Madsen, K. H., Assent, I. & Ziawasch, A.
15/03/2017 → 08/12/2021
Project: PhD