Towards a feature-based theory of audiovisual integration of speech

Juan Camilo Gil Carvajal

Research output: Book/ReportPh.D. thesis

19 Downloads (Pure)


Speech perception is facilitated by seeing the mouth movements of the talker, which is particularly useful in noisy listening environments. The mouth gestures of the talker can also modify the auditory phonetic percept. This is evidenced in the McGurk effect, a well-known audiovisual phenomenon that demonstrates the audiovisual nature of speech perception. The McGurk effect occurs when a speech sound is presented simultaneously with incongruent articulatory mouth movements corresponding to another speech token. Sometimes the McGurk effect results in a fusion of the two consonants presented. For example, dubbing auditory /aba/ onto visual /aga/ usually produces the audiovisual perception of a third consonant /ada/. In contrast, when auditory /aga/ is dubbed onto visual /aba/, the McGurk effect leads to a combination illusion of hearing the two consonants presented, typically /abga/ or /agba/. Despite decades of research in audiovisual speech, it has remained unclear why some audiovisual stimuli elicit McGurk fusions while others produce McGurk combinations, and which are the audiovisual phonetic features that affect the perceived consonant order in the latter case. This PhD project investigated audiovisual integration of speech, with a particular focus on providing behavioral evidence for a featured-based model of audiovisual integration of speech. The role of timing of phonetic features on audiovisual integration, as well as the perceived consonant order in the McGurk combination illusion, were the key aspects addressed in this thesis. In one study, the integration of audiovisual speech features was tested using stimuli that consisted of consonant clusters and single consonants, which produced novel illusory percepts. For example, auditory /abga/ dubbed onto visual /aga/ was mostly perceived as /adga/, which indicated a partial fusion illusion between the initial auditory consonant /b/ and the initial visual gesture for /g/, while the perception of the subsequent auditory consonant /g/ was unaffected. Thus, the results suggested the existence of sequential audiovisual features that are integrated separately. In the second study, the audiovisual perception of phonetic features was investigated in McGurk combination illusions. The effect of timing on the perceived consonant order was investigated by varying the audiovisual stimulus onset asynchrony (SOA) or the syllabic context by articulating the consonant in the syllable onset or offset. While varying SOA mostly affected the strength of audiovisual integration, the syllabic context mostly influenced the perceived consonant order. Notably, the asymmetry of the audiovisual temporal integration window was found to be the opposite for vowel-consonant and consonantvowel stimuli. These results supported the existence of articulatory constraints on audiovisual integration, which are imposed by the visual speech gestures of the talker. The third study further explored whether the effect of syllabic context on the perceived consonant order could be explained by a featurebased approach. The findings showed that the perceived consonant order inMcGurk combinations is driven by the timing between the acoustic release burst and the mouth movements of the talker, which provides further support for a feature-based model of audiovisual integration of speech. Overall, this thesis provided experimental evidence that constitutes a valuable foundation for the
development of a feature-based model of audiovisual integration of speech.
Original languageEnglish
PublisherTechnical University of Denmark
Number of pages121
Publication statusPublished - 2020
SeriesContributions to Hearing Research


Dive into the research topics of 'Towards a feature-based theory of audiovisual integration of speech'. Together they form a unique fingerprint.

Cite this