Abstract
This paper presents a novel approach to optimizing semantic joinability in heterogeneous data, leveraging embedding techniques and deep learning in the context of big data environments. We propose two distinct embedding strategies: a text-based approach using DistilBERT and an image-based approach utilizing ResNet50, both fine-tuned using a triplet data structure and circle loss functions to enhance joinability predictions from data lakes. By transforming tabular data into semantic embeddings, our method facilitates more effective integration of large, diverse datasets. Experiments conducted on large-scale datasets show that the fine-tuned models significantly outperform baseline approaches in accuracy and robustness. Furthermore, incorporating computer vision techniques via the image-based method demonstrates the versatility of embedding strategies across different data types. The results suggest that pre-trained models, when fine-tuned for specific joinability tasks, can provide a scalable and efficient solution for extensive data integration. Future work will explore expanding this approach to additional data modalities and optimizing model performance in large-scale applications.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2024 IEEE International Conference on Big Data (BigData) |
Publisher | IEEE |
Publication date | 2025 |
Pages | 6092-6100 |
ISBN (Print) | 979-8-3503-6249-7 |
ISBN (Electronic) | 979-8-3503-6248-0 |
DOIs | |
Publication status | Published - 2025 |
Event | 2024 IEEE International Conference on Big Data - Washington DC, United States Duration: 15 Dec 2024 → 18 Dec 2024 |
Conference
Conference | 2024 IEEE International Conference on Big Data |
---|---|
Country/Territory | United States |
City | Washington DC |
Period | 15/12/2024 → 18/12/2024 |
Keywords
- Big data
- Deep learning
- Fine-tuning
- Tabular data
- Feature representation learning