Computational modeling of speech intelligibility in adverse conditions

Alexandre Chabot-Leclerc

Research output: Book/ReportPh.D. thesisResearch

178 Downloads (Pure)

Abstract

The intelligibility of speech is a measure of how well speech is understood in a given situation. Developing models to predict intelligibility can help develop a better understanding of the essential “features” of speech, how those features are extracted by the auditory system, and how they are combined and used to create understanding. This dissertation expands on a model named the speechbased envelope power spectrum model (sEPSM), which uses the signal-to-noise ratio in the envelope power domain (SNRenv) as the decision metric. The sEPSM was analyzed and compared to several other models that either use different front-ends or different decision metrics, such as the audio SNR. The goal was to tease apart the essential components of intelligibility models in a range of conditions known to be challenging. One condition considered speech that was distorted by a phase jitter process, which destroys its spectral integrity. It was shown that the sEPSM could account for the deleterious effects of phase jitter if an across-channel process was included in the analysis stage, which measures the variability of the envelope power across audio frequencies. In another condition of nonlinear distortion, noise reduction via spectral subtraction, it was shown that across-channel processing was not essential. Furthermore, a quantitative model was developed in an attempt to predict the speech intelligibility measured in conditions where listeners are known to benefit from using both ears, compared to using either ear alone, such as in a noisy “cocktail party”. The model represents a binaural extension of the sEPSM, denoted as B-sEPSM. It consists of realizations of the sEPSM for the monaural pathways, combined with an equalization–cancellation (EC) process to model an across-ear noise reduction mechanism. The sEPSM process also operates at the output of the EC process, such that all pathways are directly comparable. The B-sEPSM was shown to account for intelligibility as a function of the number of maskers, the azimuth of the maskers, the room properties (anechoic or reverberant), the masker types (stationary noise, fluctuating noise, and time-reversed speech), and the interaural time differences of the target and maskers. Finally, simulation results showed that binaural processing was not always necessary in spatial conditions, and that the SNRenv metric could capture aspects of masking that were not considered by models that used the audio SNR as the decision metric. However, none of the models considered could account for the intelligibility in conditions with so-called “informational masking”, because they did not take into account confusions in the decision-making process experienced by the listeners. A possible method for estimating such confusions was proposed, based on a “distance metric” between the envelope power spectrum representation of the speech estimate and of the noise. Overall, the results of this thesis support the hypothesis that the SNRenv is a powerful metric for intelligibility prediction. Furthermore, the B-sEPSM could be used to investigate the impact on intelligibility of different binaural noise reduction techniques, such as beam-forming, and of various binaural hearing
aid compression strategies.
Original languageEnglish
PublisherDTU Elektro
Number of pages141
Publication statusPublished - 2016

Fingerprint Dive into the research topics of 'Computational modeling of speech intelligibility in adverse conditions'. Together they form a unique fingerprint.

Cite this