Towards a universal representation for audio information retrieval and analysis

Bjørn Sand Jensen, Rasmus Troelsgaard, Jan Larsen, Lars Kai Hansen

Research output: Contribution to journalConference articleResearchpeer-review

Abstract

A fundamental and general representation of audio and music which integrates multi-modal data sources is important for both application and basic research purposes. In this paper we address this challenge by proposing a multi-modal version of the Latent Dirichlet Allocation model which provides a joint latent representation. We evaluate this representation on the Million Song Dataset by integrating three fundamentally different modalities, namely tags, lyrics, and audio features. We show how the resulting representation is aligned with common 'cognitive' variables such as tags, and provide some evidence for the common assumption that genres form an acceptable categorization when evaluating latent representations of music. We furthermore quantify the model by its predictive performance in terms of genre and style, providing benchmark results for the Million Song Dataset.
Original languageEnglish
JournalI E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings
Pages (from-to)3168 - 3172
ISSN1520-6149
DOIs
Publication statusPublished - 2013
EventIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013) - Vancouver, Canada
Duration: 26 May 201331 May 2013
http://www.icassp2013.com/

Conference

ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013)
CountryCanada
CityVancouver
Period26/05/201331/05/2013
Internet address

Bibliographical note

This work was supported in part by the Danish Council for Strategic Research of the Danish Agency for Science Technology and Innovation under the CoSound project, case number 11-115328

Keywords

  • Signal Processing and Analysis

Fingerprint Dive into the research topics of 'Towards a universal representation for audio information retrieval and analysis'. Together they form a unique fingerprint.

Cite this