Data Integration for Industrial Big Data Applications

Laurent Vermue

Research output: Book/ReportPh.D. thesisResearch

101 Downloads (Pure)

Abstract

In modern applications, there are found several operational data storage systems and large amounts of heterogeneous data that is being collected, both in business and science contexts. At the same time, the data generation is in general error prone, meaning that the data entry process always will produce dirty data to some extent, either caused by human or system failure. Data integration comprises the task of cleaning dirty data and reconciling the different data sources into one homogeneous data set, which is a crucial step on the way to developing big data applications, such as machine learning models that rely on a vast amount of data. However, the pace of data creation has by far exceeded the capability of current data integration approaches, as these often rely on domain experts. As a consequence, a large fraction of valuable data is not utilized for analysis and thus leaves unused potential in every business field, which needs to be addressed. Regarding this, this thesis seeks to develop data integration
methods for big data applications by employing machine learning algorithms to ease the data integration problem in domain specific contexts, which is covered by a threefold contribution. First, this thesis has provided research contributions that investigated algorithms in the area of data fusion and relational machine learning with a focus on their suitability to solve data integration challenges. A case study of an enzyme producer in the process industry that involved multiple data sources evidently showed that before the actual task of developing a machine learning solution, multiple data sources require careful data integration steps, which are inherently hard. Many of the data integration challenges are complex, because they require relational knowledge of the data at hand that often goes beyond the data source or is not even contained in the data itself, which often leads to the necessity of involving domain experts, who possess the required knowl-edge. To accommodate this requirement of understanding relational knowledge this thesis has covered three research contributions in the field of relational machine learning. This includes the Bayesian Cut, which is a specialized model for community detection in graphs, as well as two contributions that investigate knowledge graph embedding models in greater detail. In particular, a knowledge graph embedding model framework was developed that enables composability
and reproducibility and was accompanied by a large-scale benchmarking study, which made the contribution of individual components transparent, such that the necessary design decisions are more comprehensible and accessible. Furthermore, a new measure was proposed that alleviates the conceptually flawed way of measuring the performance of knowledge graph embedding models, which is
vital when relying on knowledge graph embedding models in big data applications. Second, as knowledge graph embedding models are suitable knowledge integrators that can be used to predict new knowledge based on relations between objects, this thesis proposed a data integration framework purely based on machine learning that allows the combination of other machine learning approaches
with knowledge graph embedding models as an artificial domain expert to solve the above-mentioned data integration challenges. In the proof-of-concept experiments that were covered in this thesis, this approach has been shown to be a very suitable approach that has promising merits. These merits include that it can scale vastly while it can accommodate various existing data profiling approaches and various types of information that are handled through the knowledge graph embedding model in a relational fashion. Third, all research contributions as well as the experiments are based on research
software that was created as a part of this thesis, which is openly available, accompanied by journal publications and thoroughly documented, and thus fosters future research in the covered research areas and beyond as well as applications that build on it, such as the data integration framework proposed in this thesis.
Original languageEnglish
PublisherTechnical University of Denmark
Number of pages186
Publication statusPublished - 2022

Fingerprint

Dive into the research topics of 'Data Integration for Industrial Big Data Applications'. Together they form a unique fingerprint.

Cite this