In many branches of spoken language analysis including ASR, the set of smallest meaningful units of speech is taken to coincide with the set of phones or phonemes. However, fishing for phones is difficult, error-prone, and computationally expensive. We present an experiment, based on machine learning, with an alternative approach. Instead of stipulating a basic set of target units, the determination of the set is considered to be part of the learning task. Given 18 recordings of Danish talkers performing a simple lab task, our algorithm produced a set of acoustically well-defined units sufficient for identifying all the major semantic elements (be they parts of words, words or several words), relevant to the task. As the sound encoding used was very simple – fundamental frequency (F0), Harmonicity-to-Noise-Ratio (HNR), and Intensity samples only – the computational complexity involved was far lower than for phonemic recognition. Our findings show that it is possible to automatically characterize a linguistic message, without detailed spectral information or presumptions about the target units. Further, fishing for simple meaningful cues and enhancing these selectively would potentially be a more effective way of achieving intelligibility transfer, which is the end goal for speech transducing technologies.
|Title of host publication||Proceedings of ISAAR 2009|
|Editors||Jörg Buchholz, Torsten Dau, Jakob Christensen-Dalsgaard, Torben Poulsen|
|Publication status||Published - 2009|
|Event||2nd International Symposium on Auditory and Audiological Research: Binaural Processing and Spatial Hearing - Marienlyst, Helsingør, Denmark|
Duration: 26 Aug 2009 → 28 Aug 2009
|Conference||2nd International Symposium on Auditory and Audiological Research|
|Period||26/08/2009 → 28/08/2009|