On scaling contrastive representations for low-resource speech recognition

Lasse Borgholt, Tycho M.S. Tax, Jakob D. Havtorn, Lars Maaløe, Christian Igel

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

62 Downloads (Pure)

Abstract

Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
Volume2021-
PublisherIEEE
Publication date2021
Pages3885-3889
DOIs
Publication statusPublished - 2021
Event2021 IEEE International Conference on Acoustics, Speech and Signal Processing - Virtual event, Toronto, Canada
Duration: 6 Jun 202111 Jun 2021
Conference number: 46
https://www.2021.ieeeicassp.org/2021.ieeeicassp.org/index.html

Conference

Conference2021 IEEE International Conference on Acoustics, Speech and Signal Processing
Number46
LocationVirtual event
Country/TerritoryCanada
CityToronto
Period06/06/202111/06/2021
Internet address
SeriesICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN1520-6149

Keywords

  • Automatic speech recognition
  • Representation learning
  • Self-supervised learning
  • Semi-supervised learning
  • Unsupervised learning

Fingerprint

Dive into the research topics of 'On scaling contrastive representations for low-resource speech recognition'. Together they form a unique fingerprint.

Cite this