Abstract
Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
Original language | English |
---|---|
Title of host publication | ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing |
Volume | 2021- |
Publisher | IEEE |
Publication date | 2021 |
Pages | 3885-3889 |
DOIs | |
Publication status | Published - 2021 |
Event | 2021 IEEE International Conference on Acoustics, Speech and Signal Processing - Virtual event, Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021 Conference number: 46 https://www.2021.ieeeicassp.org/2021.ieeeicassp.org/index.html |
Conference
Conference | 2021 IEEE International Conference on Acoustics, Speech and Signal Processing |
---|---|
Number | 46 |
Location | Virtual event |
Country/Territory | Canada |
City | Toronto |
Period | 06/06/2021 → 11/06/2021 |
Internet address |
Series | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
ISSN | 1520-6149 |
Keywords
- Automatic speech recognition
- Representation learning
- Self-supervised learning
- Semi-supervised learning
- Unsupervised learning