Audiovisual speech analysis with deep learning

Nicolai Fernández Pedersen

Research output: Book/ReportPh.D. thesis

73 Downloads (Pure)

Abstract

During speech production, the movement of speech articulators creates visual signals that are temporally aligned with the acoustic speech signal. These visual speaker cues have been found to facilitate speech perception in humans, especially in noisy auditory environments such as "cocktail-party" scenarios. Besides facilitating human speech perception, it is also well established that machines can learn to utilize visual speaker cues to informauditory representations of speech. Visual speaker cues from target speakers have thus been shown to improve the performance of both automatic speech recognition systems and speech separation systems in contrast to audio-only systems. However, while many studies have investigated the temporal correspondences between auditory and visual signals, there is still a lack of knowledge about the nature of these audiovisual (AV) cues and how the two modalities are related.
This thesis aimed to contribute to a better understanding of the relationship between auditory and visual cues created during speech production. By utilizing recent advances in computer vision and data-driven approaches, natural AV speech was investigated across thousands of speakers. First, using a linear canonical correlation analysis (CCA), two primary temporal ranges of envelope fluctuations related to facial motion across speakers were identified. Amplitude envelope modulations distributed around 3-4 Hz were related to mouth openings, whereas 1-2 Hz modulations were related to more global face and head motion. Next, nonlinear neural networks were trained through a self-supervised learning scheme to learn correlated AV embeddings from natural AV speech videos. Highly correlated AV features primarily located around the mouth and jaw were identified. Based on these insights, it was examined whether the different AV features could assist a speech separation model in extracting the acoustic speech stream of a target talker from multi-talker audio mixtures. More correlated AV feature embeddings translated to better speech separation performance. Notably, the speech separation models achieved a performance comparable to more computational complex systems while showing promise for real-time implementation.
Overall, this thesis provided new insights into how auditory and visual speech cues are related and showed their usefulness in audiovisual speech separation.
Original languageEnglish
PublisherDTU Health Technology
Number of pages119
Publication statusPublished - 2021
SeriesContributions to Hearing Research
Volume50

Fingerprint

Dive into the research topics of 'Audiovisual speech analysis with deep learning'. Together they form a unique fingerprint.

Cite this