Abstract
A monaural speech segregation system is presented that estimates the ideal binary mask from noisy
speech based on the supervised learning of amplitude modulation spectrogram (AMS) features.
Instead of using linearly scaled modulation filters with constant absolute bandwidth, an auditory-
inspired modulation filterbank with logarithmically scaled filters is employed. To reduce the
dependency of the AMS features on the overall background noise level, a feature normalization
stage is applied. In addition, a spectro-temporal integration stage is incorporated in order to exploit
the context information about speech activity present in neighboring time-frequency units. In order
to evaluate the generalization performance of the system to unseen acoustic conditions, the speech
segregation system is trained with a limited set of low signal-to-noise ratio (SNR) conditions, but
tested over a wide range of SNRs up to 20dB. A systematic evaluation of the system demonstrates
that auditory-inspired modulation processing can substantially improve the mask estimation accuracy in the presence of stationary and fluctuating interferers
Original language | English |
---|---|
Journal | Journal of the Acoustical Society of America |
Volume | 136 |
Issue number | 6 |
Pages (from-to) | 3350–3359 |
ISSN | 0001-4966 |
DOIs | |
Publication status | Published - 2014 |