Abstract
Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th Northern Lights Deep Learning Conference (NLDL) |
Volume | 233 |
Publisher | Proceedings of Machine Learning Research |
Publication date | 2024 |
Pages | 181-204 |
Publication status | Published - 2024 |
Event | 5th Northern Lights Deep Learning Conference - Tromsø, Norway Duration: 9 Jan 2024 → 11 Jan 2024 Conference number: 5 |
Conference
Conference | 5th Northern Lights Deep Learning Conference |
---|---|
Number | 5 |
Country/Territory | Norway |
City | Tromsø |
Period | 09/01/2024 → 11/01/2024 |