Projects per year
Abstract
During speech production, the movement of speech articulators creates visual signals that are temporally aligned with the acoustic speech signal. These visual speaker cues have been found to facilitate speech perception in humans, especially in noisy auditory environments such as "cocktail-party" scenarios. Besides facilitating human speech perception, it is also well established that machines can learn to utilize visual speaker cues to informauditory representations of speech. Visual speaker cues from target speakers have thus been shown to improve the performance of both automatic speech recognition systems and speech separation systems in contrast to audio-only systems. However, while many studies have investigated the temporal correspondences between auditory and visual signals, there is still a lack of knowledge about the nature of these audiovisual (AV) cues and how the two modalities are related.
This thesis aimed to contribute to a better understanding of the relationship between auditory and visual cues created during speech production. By utilizing recent advances in computer vision and data-driven approaches, natural AV speech was investigated across thousands of speakers. First, using a linear canonical correlation analysis (CCA), two primary temporal ranges of envelope fluctuations related to facial motion across speakers were identified. Amplitude envelope modulations distributed around 3-4 Hz were related to mouth openings, whereas 1-2 Hz modulations were related to more global face and head motion. Next, nonlinear neural networks were trained through a self-supervised learning scheme to learn correlated AV embeddings from natural AV speech videos. Highly correlated AV features primarily located around the mouth and jaw were identified. Based on these insights, it was examined whether the different AV features could assist a speech separation model in extracting the acoustic speech stream of a target talker from multi-talker audio mixtures. More correlated AV feature embeddings translated to better speech separation performance. Notably, the speech separation models achieved a performance comparable to more computational complex systems while showing promise for real-time implementation.
Overall, this thesis provided new insights into how auditory and visual speech cues are related and showed their usefulness in audiovisual speech separation.
This thesis aimed to contribute to a better understanding of the relationship between auditory and visual cues created during speech production. By utilizing recent advances in computer vision and data-driven approaches, natural AV speech was investigated across thousands of speakers. First, using a linear canonical correlation analysis (CCA), two primary temporal ranges of envelope fluctuations related to facial motion across speakers were identified. Amplitude envelope modulations distributed around 3-4 Hz were related to mouth openings, whereas 1-2 Hz modulations were related to more global face and head motion. Next, nonlinear neural networks were trained through a self-supervised learning scheme to learn correlated AV embeddings from natural AV speech videos. Highly correlated AV features primarily located around the mouth and jaw were identified. Based on these insights, it was examined whether the different AV features could assist a speech separation model in extracting the acoustic speech stream of a target talker from multi-talker audio mixtures. More correlated AV feature embeddings translated to better speech separation performance. Notably, the speech separation models achieved a performance comparable to more computational complex systems while showing promise for real-time implementation.
Overall, this thesis provided new insights into how auditory and visual speech cues are related and showed their usefulness in audiovisual speech separation.
Original language | English |
---|
Publisher | DTU Health Technology |
---|---|
Number of pages | 119 |
Publication status | Published - 2021 |
Series | Contributions to Hearing Research |
---|---|
Volume | 50 |
Fingerprint
Dive into the research topics of 'Audiovisual speech analysis with deep learning'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Audiovisual speech analysis with deep learning
Pedersen, N. (PhD Student), Tan, Z.-H. (Examiner), Yehia, H. C. (Examiner), May, T. (Examiner), Hjortkjær, J. (Main Supervisor), Dau, T. (Supervisor) & Hansen, L. K. (Supervisor)
01/06/2018 → 08/04/2022
Project: PhD