Whereas the human auditory system has remarkable capabilities to focus on a particular target source in complex multi-source scenarios, it has remained a challenging task to develop algorithms that are able to retrieve information about sound sources in a complex acoustic scene (e.g. to localize and identify active speech sources). A robust binaural scene recognizer will be presented that is able to simultaneously localize and classify a predefined number of target speech sources in the presence of reverberation and interfering noise. The model consists of three stages: localization stage, detection of speech sources, and recognition of speaker identities. First, a binaural front-end is used to localize relevant sound source activity. Based on this localization information, a binary mask is determined which identifies the activity of individual sound sources on a time-frequency (T-F) basis. The localization is based on the supervised learning of azimuth-dependent binaural features, namely interaural time and level differences (ITDs and ILDs). Secondly, a speech detection module determines whether the corresponding source type is speech or noise for all sound sources that have been found. For this purpose the estimated binary mask and the corresponding spectral features are passed to a missing data classifier for each sound source candidate. Finally, the speaker identity of all detected speech sources is recognized. The proposed system is analyzed in simulated, adverse conditions including interfering noise, reverberation and the presence of multiple target sources. Compared to a state-of-the art MFCC recognizer, the proposed model achieves significant speaker recognition accuracy improvements.
|Title of host publication||Proceedings of Forum Acusticum|
|Publication status||Published - 2011|
|Event||Forum Acusticum 2011 - Aalborg, Denmark|
Duration: 26 Jun 2011 → 1 Jul 2011
|Conference||Forum Acusticum 2011|
|Period||26/06/2011 → 01/07/2011|
|Series||Proceedings of Forum Acusticum|