The early maximum likelihood estimation model of audiovisual integration in speech perception

Research output: Contribution to journalJournal articleResearchpeer-review

453 Downloads (Pure)


Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk−MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely focused on the fuzzy logical model of perception (FLMP), which provides excellent fits to experimental observations but also has been criticized for being too flexible, post hoc and difficult to interpret. The current study introduces the early maximum likelihood estimation (MLE) model of audiovisual integration to speech perception along with three model variations. In early MLE, integration is based on a continuous internal representation before categorization, which can make the model more parsimonious by imposing constraints that reflect experimental designs. The study also shows that cross-validation can evaluate models of audiovisual integration based on typical data sets taking both goodness-of-fit and model flexibility into account. All models were tested on a published data set previously used for testing the FLMP. Cross-validation favored the early MLE while more conventional error measures favored more complex models. This difference between conventional error measures and cross-validation was found to be indicative of over-fitting in more complex models such as the FLMP.
Original languageEnglish
JournalAcoustical Society of America. Journal
Issue number5
Pages (from-to)2884-2891
Publication statusPublished - 2015

Bibliographical note

Copyright 2015 Acoustical Society of America. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the Acoustical Society of America.

Fingerprint Dive into the research topics of 'The early maximum likelihood estimation model of audiovisual integration in speech perception'. Together they form a unique fingerprint.

Cite this