How Redundant Is the Transformer Stack in Speech Representation Models?

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size by 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
Original languageEnglish
Title of host publicationProceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Number of pages5
PublisherIEEE
Publication date2025
ISBN (Print)979-8-3503-6875-8
ISBN (Electronic)979-8-3503-6874-1
DOIs
Publication statusPublished - 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing - Hyderabad, India
Duration: 6 Apr 202511 Apr 2025

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing
Country/TerritoryIndia
CityHyderabad
Period06/04/202511/04/2025

Keywords

  • Layer similarity
  • Pruning
  • Redundancy
  • Speech representation learning
  • Transformers

Fingerprint

Dive into the research topics of 'How Redundant Is the Transformer Stack in Speech Representation Models?'. Together they form a unique fingerprint.

Cite this