Abstract
We apply machine learning techniques to the problem of separating
multiple speech sources from a single microphone recording.
The method of choice is a sparse non-negative matrix factorization
algorithm, which in an unsupervised manner can learn sparse representations
of the data. This is applied to the learning of personalized
dictionaries from a speech corpus, which in turn are used
to separate the audio stream into its components. We show that
computational savings can be achieved by segmenting the training
data on a phoneme level. To split the data, a conventional speech
recognizer is used. The performance of the unsupervised and supervised
adaptation schemes result in significant improvements in
terms of the target-to-masker ratio.
Original language | English |
---|---|
Title of host publication | Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH) |
Publication date | 2007 |
Publication status | Published - 2007 |
Event | Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH) - Duration: 1 Jan 2007 → … |
Conference
Conference | Spoken Language Proceesing, ISCA International Conference on (INTERSPEECH) |
---|---|
Period | 01/01/2007 → … |