Computational speech segregation inspired by principles of auditory processing

Thomas Bentsen

Research output: Book/ReportPh.D. thesisResearch

156 Downloads (Pure)


Understanding speech in noise in adverse listening conditions can be challenging for many people, in particular hearing-aid users and cochlearimplant recipients. To improve the speech understanding, better noise reduction strategies are needed in such devices. The performance of the strategies depends on how well the characteristics of the speech and the noise are known. Therefore, it is necessary to have automatic approaches that can separate the speech from the noise as accurate as possible, which is the overall goal of computational speech segregation. Often, an ideal time-frequency mask is estimated in these approaches. In the mask, the level of speech activity is indicated in each time-frequency unit. The mask is estimated by extracting auditory-inspired features from the noisy speech and subsequently learning the characteristics of the speech and noise with machine-learning techniques. This thesis investigated three approaches within computational speech segregation based on ideal time-frequency mask estimation. The approaches were evaluated in the framework of noise reduction to improve speech understanding of normal-hearing listeners and cochlear-implant recipients in noisy environments. In the first approach, machine-learning techniques were employed in separate auditory frequency bands to classifying each mask unit as either speech-dominated or noise-dominated. Words are composed of phonemes
that may occupy several neighboring units in the estimated mask. The focus was on how to use this contextual information in speech across time and frequency in computational speech segregation. Exploiting the context across frequency was found to be important. By increasing the amount of considered spectral information, higher measured speech intelligibility was obtained in normal-hearing listeners. On the other hand, exploiting the context across time in computational speech segregation is perhaps not a critical factor to increase speech intelligibility. Recent approaches within computational speech segregation are based on deep neural networks, and speech intelligibility improvements have successfully been demonstrated in adverse conditions. In a second approach, a deep neural network was therefore employed and the roles and the relative contribution of a selection of components, that may be responsible for the success, were analyzed. Two components, namely the network architecture and the estimation of an ideal time-frequency mask based on continuous gain values, were found to play a significant role. In a third approach, an application of the estimated time-frequency mask was considered in real-time cochlear-implant processing. A proposed speech coding strategy selects cochlear-implant channels for electrical stimulation, and only if the signal-to-noise ratio within the channel is larger or equal to a local criterion. However, this strategy relies on ideal signal-to-noise ratios and a noise power estimation stage is, therefore, required to estimate the signal-to-noise ratios in real-time cochlear-implant processing. Results implied that a noise power estimation with improved noise-tracking capabilities does not necessarily translate to increased speech intelligibility. However, the adaptive channel selection is important for reducing the noise-induced stimulation in the cochlear-implant recipients. Overall, the results of this thesis have implications for the design of computational speech segregation approaches with noise-reduction applications. Furthermore, the results may guide the development of a single cost function, which correlates with speech intelligibility, to assess and optimize the system performance.
Original languageEnglish
PublisherTechnical University of Denmark
Number of pages133
Publication statusPublished - 2018

Fingerprint Dive into the research topics of 'Computational speech segregation inspired by principles of auditory processing'. Together they form a unique fingerprint.

Cite this