Modelling context in automatic speech recognition

Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the other hand, recognising speech is a daunting task. It has to deal with a large number of different voices “influenced, among other things, by emotion, moods and fatigue” the acoustic properties of different environments, dialects, a huge vocabulary and an unlimited creativity of speakers to combine words and to break the rules of grammar. Almost all existing automatic speech recognisers use statistics over speech sounds “what is the probability that a piece of audio is an a-sound” and statistics over word combinations to deal with this complexity. The results of those systems are impressive but unfortunately not good enough for most applications of speech recognition. This thesis proposes to put context information in the models of speech recognition to achieve better recognition results. Context is defined as knowledge of the speaker, such as gender and dialect, knowledge of the conversation and knowledge of the world. The influence of each of those categories is investigated using data analysis and case studies and new models for speech recognition are defined. In particular, a model that dynamically adapts the vocabulary of the recogniser to the topic of a conversation, which it can automatically determine, is presented.

File Type: pdf
File Size: 4 MB
Publication Year: 2008
Author: Wiggers, Pascal
Supervisors: L.J.M. Rothkrantz, H. Koppelaar
Institution: Delft University of Technology
Keywords: automatic speech recognition; language modelling; dynamic Bayesian networks