Optimizing Semantic Joinability in Heterogeneous Data: A Triplet-Based Approach with Pre-trained Deep Learning Models

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

This paper presents a novel approach to optimizing semantic joinability in heterogeneous data, leveraging embedding techniques and deep learning in the context of big data environments. We propose two distinct embedding strategies: a text-based approach using DistilBERT and an image-based approach utilizing ResNet50, both fine-tuned using a triplet data structure and circle loss functions to enhance joinability predictions from data lakes. By transforming tabular data into semantic embeddings, our method facilitates more effective integration of large, diverse datasets. Experiments conducted on large-scale datasets show that the fine-tuned models significantly outperform baseline approaches in accuracy and robustness. Furthermore, incorporating computer vision techniques via the image-based method demonstrates the versatility of embedding strategies across different data types. The results suggest that pre-trained models, when fine-tuned for specific joinability tasks, can provide a scalable and efficient solution for extensive data integration. Future work will explore expanding this approach to additional data modalities and optimizing model performance in large-scale applications.
Original languageEnglish
Title of host publicationProceedings of the 2024 IEEE International Conference on Big Data (BigData)
PublisherIEEE
Publication date2025
Pages6092-6100
ISBN (Print)979-8-3503-6249-7
ISBN (Electronic)979-8-3503-6248-0
DOIs
Publication statusPublished - 2025
Event2024 IEEE International Conference on Big Data - Washington DC, United States
Duration: 15 Dec 202418 Dec 2024

Conference

Conference2024 IEEE International Conference on Big Data
Country/TerritoryUnited States
CityWashington DC
Period15/12/202418/12/2024

Keywords

  • Big data
  • Deep learning
  • Fine-tuning
  • Tabular data
  • Feature representation learning

Fingerprint

Dive into the research topics of 'Optimizing Semantic Joinability in Heterogeneous Data: A Triplet-Based Approach with Pre-trained Deep Learning Models'. Together they form a unique fingerprint.

Cite this