Projects per year
Abstract
Speech enhancement aims at improving the intelligibility and quality of speech signals corrupted by noise and reverberation, and has applications in many areas, such as hearing aids and communication systems. Recent approaches have employed deep neural networks (DNNs) due to their superior performance over traditional approaches. These systems are usually trained with a large number of noisy and reverberant speech signals. However, the performance of DNN-based systems substantially degrades in acoustic conditions that were not included in the training stage.
This thesis contributes to the understanding of the generalization capabilities of DNN-based speech enhancement systems. A novel generalization assessment framework is proposed, where the DNN-based systems are trained and evaluated in a wide range of conditions using multiple speech, noise and binaural room impulse response (BRIR) databases in a cross-validation fashion. To control for the change in difficulty of the speech enhancement task across databases, a reference system is trained on each test condition and used as a proxy for the task difficulty in that condition. A speech mismatch between training and testing is found to be the main cause of performance degradation, while strong noise and room generalization is observed when training the systems with multiple noise and BRIR databases. A simple feedforward neural network is found to outperform state-of-the-art systems in mismatched conditions, unless the training stage includes multiple speech, noise and BRIR databases, which is not common practice in literature. As the proposed method requires training each system multiple times, different batching strategies are investigated to improve training times and graphics processing unit memory usage.
In the past two years, diffusion models have received increasing attention for the task of speech enhancement. These approaches have shown promising results in mismatched conditions, but are computationally expensive and still poorly understood. Therefore, a systematic investigation of the design space of diffusion models for speech enhancement is conducted. This investigation results in the development of a system that outperforms a popular diffusion-based baseline in terms of perceptual metrics at a lower computational cost. However, results also show that diffusion-based speech enhancement systems do not benefit from large-scale training as much as previous DNN-based systems.
This thesis contributes to the understanding of the generalization capabilities of DNN-based speech enhancement systems. A novel generalization assessment framework is proposed, where the DNN-based systems are trained and evaluated in a wide range of conditions using multiple speech, noise and binaural room impulse response (BRIR) databases in a cross-validation fashion. To control for the change in difficulty of the speech enhancement task across databases, a reference system is trained on each test condition and used as a proxy for the task difficulty in that condition. A speech mismatch between training and testing is found to be the main cause of performance degradation, while strong noise and room generalization is observed when training the systems with multiple noise and BRIR databases. A simple feedforward neural network is found to outperform state-of-the-art systems in mismatched conditions, unless the training stage includes multiple speech, noise and BRIR databases, which is not common practice in literature. As the proposed method requires training each system multiple times, different batching strategies are investigated to improve training times and graphics processing unit memory usage.
In the past two years, diffusion models have received increasing attention for the task of speech enhancement. These approaches have shown promising results in mismatched conditions, but are computationally expensive and still poorly understood. Therefore, a systematic investigation of the design space of diffusion models for speech enhancement is conducted. This investigation results in the development of a system that outperforms a popular diffusion-based baseline in terms of perceptual metrics at a lower computational cost. However, results also show that diffusion-based speech enhancement systems do not benefit from large-scale training as much as previous DNN-based systems.
Original language | English |
---|
Publisher | DTU Health Technology |
---|---|
Number of pages | 169 |
Publication status | Published - 2024 |
Series | Contributions to Hearing Research |
---|---|
Volume | 60 |
Fingerprint
Dive into the research topics of 'Robust Speech Enhancement in Noisy and Reverberant Environments Using Deep Neural Networks'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Robust Speech Enhancement in Noisy and Reverberant Environments using Deep Neural Networks
Gonzalez, P. (PhD Student), May, T. (Main Supervisor), Alstrøm, T. S. (Supervisor), Barker, J. (Examiner) & Serizel, R. (Examiner)
01/12/2020 → 10/06/2024
Project: PhD