Hierarchical Language Modeling for One-Stage Stochastic Interpretation of Natural Speech

The thesis deals with automatic interpretation of naturally spoken utterances for limited-domain applications. Specifically, the problem is examined by means of a dialogue system for an airport information application. In contrast to traditional two-stage systems, speech recognition and semantic processing are tightly coupled. This avoids interpretation errors due to early decisions. The presented one-stage decoding approach utilizes a uniform, stochastic knowledge representation based on weighted transition network hierarchies, which describe phonemes, words, word classes and semantic concepts. A robust semantic model, which is estimated by combination of data-driven and rule-based approaches, is part of this representation. The investigation of this hierarchical language model is the focus of this work. Furthermore, methods for modeling out-of-vocabulary words and for evaluating semantic trees are introduced.

Thomae, Matthias — Technische Universität München


Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...

Bahari, Mohamad Hasan — KU Leuven


Audio-visual processing and content management techniques, for the study of (human) bioacoustics phenomena

The present doctoral thesis aims towards the development of new long-term, multi-channel, audio-visual processing techniques for the analysis of bioacoustics phenomena. The effort is focused on the study of the physiology of the gastrointestinal system, aiming at the support of medical research for the discovery of gastrointestinal motility patterns and the diagnosis of functional disorders. The term "processing" in this case is quite broad, incorporating the procedures of signal processing, content description, manipulation and analysis, that are applied to all the recorded bioacoustics signals, the auxiliary audio-visual surveillance information (for the monitoring of experiments and the subjects' status), and the extracted audio-video sequences describing the abdominal sound-field alterations. The thesis outline is as follows. The main objective of the thesis, which is the technological support of medical research, is presented in the first chapter. A quick problem definition is initially ...

Dimoulas, Charalampos — Department of Electrical and Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece


Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech

One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to ...

Sun, Meng — Katholieke Universiteit Leuven


Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University


Modelling context in automatic speech recognition

Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the other hand, recognising speech is a daunting task. It has to deal with a large number of different voices "influenced, among other things, by emotion, moods and fatigue" the acoustic properties of different environments, dialects, a huge vocabulary and an unlimited creativity of speakers to combine words and to break the rules of grammar. Almost all existing automatic speech recognisers use statistics over speech sounds "what is the probability that a piece of audio is an a-sound" and statistics over word combinations to deal with this complexity. The ...

Wiggers, Pascal — Delft University of Technology


Confidence Measures for Speech/Speaker Recognition and Applications on Turkish LVCSR

Con dence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Con dence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system. Speech/speaker recognition systems are usually based on statistical modeling techniques. In this thesis we de ned con dence measures for statistical modeling techniques used in speech/speaker recognition systems. For speech recognition we tested available con dence measures and the newly de ned acoustic prior information based con dence measure in two di erent conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly de ned con dence measure performs better in both tests. Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis. We de ned also ...

Mengusoglu, Erhan — Universite de Mons


Models and Software Realization of Russian Speech Recognition based on Morphemic Analysis

Above 20% European citizens speak in Russian therefore the task of automatic recognition of Russian continuous speech has a key significance. The main problems of ASR are connected with the complex mechanism of Russian word-formation. Totally there exist above 3 million diverse valid word-forms that is very large vocabulary ASR task. The thesis presents the novel HMM-based ASR model of Russian that has morphemic levels of speech and language representation. The model includes the developed methods for decomposition of the word vocabulary into morphemes and acoustical and statistical language modelling at the training stage and the method for word synthesis at the last stage of speech decoding. The presented results of application of the ASR model for voice access to the Yellow Pages directory have shown the essential improvement (above 75%) of the real-time factor saving acceptable word recognition rate ...

Karpov, Alexey — St.Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences


Semantic Similarity in Automatic Speech Recognition for Meetings

This thesis investigates the application of language models based on semantic similarity to Automatic Speech Recognition for meetings. We consider data-driven Latent Semantic Analysis based and knowledge-driven WordNet-based models. Latent Semantic Analysis based models are trained for several background domains and it is shown that all background models reduce perplexity compared to the n-gram baseline models, and some background models also significantly improve speech recognition for meetings. A new method for interpolating multiple models is introduced and the relation to cache-based models is investigated. The semantics of the models is investigated through a synonymity task. WordNet-based models are defined for different word-word similarities that use information encoded in the WordNet graph and corpus information. It is shown that these models can significantly improve over baseline random models on the task of word prediction, and that the chosen part-of-speech context is ...

Pucher, Michael — Graz University of Technology


The Bionic Electro-Larynx Speech System - Challenges, Investigations, and Solutions

Humans without larynx need to use a substitution voice to re-obtain speech. The electro-larynx (EL) is a widely used device but is known for its unnatural and monotonic speech quality. Previous research tackled these problems, but until now no significant improvements could be reported. The EL speech system is a complex system including hardware (artificial excitation source or sound transducer) and software (control and generation of the artificial excitation signal). It is not enough to consider one separated problem, but all aspects of the EL speech system need to be taken into account. In this thesis we would like to push forward the boundaries of the conventional EL device towards a new bionic electro-larynx speech system. We formulate two overall scenarios: a closed-loop scenario, where EL speech is excited and simultaneously recorded using an EL speech system, and the artificial ...

Fuchs, Anna Katharina — Graz University of Technology, Signal Processing and Speech Communication Laboratory


Source-Filter Model Based Single Channel Speech Separation

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which find a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The first part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition ...

Stark, Michael — Graz University of Technology


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Blind Source Separation of functional dynamic MRI signals via Dictionary Learning

Magnetic Resonance Imaging (MRI) constitutes a non-invasive medical imaging technique that allows the exploration of the inner anatomy, tissues, and physiological processes of the body. Among the different MRI applications, functional Magnetic Resonance Imaging (fMRI) has slowly become an essential tool for investigating the brain behavior and, nowadays, it plays a fundamental role in clinical and neurophysiological research. Due to its particular nature, specialized signal processing techniques are required in order to analyze the fMRI data properly. Among the various related techniques that have been developed over the years, the General Linear Model (GLM) is one of the most widely used approaches, and it usually appears as a default in many specialized software toolboxes for fMRI. On the other end, Blind Source Separation (BSS) methods constitute the most common alternative to GLM, especially when no prior information regarding the brain ...

Morante, Manuel — National and Kapodistrian University of Athens


Kernel PCA and Pre-Image Iterations for Speech Enhancement

In this thesis, we present novel methods to enhance speech corrupted by noise. All methods are based on the processing of complex-valued spectral data. First, kernel principal component analysis (PCA) for speech enhancement is proposed. Subsequently, a simplification of kernel PCA, called pre-image iterations (PI), is derived. This method computes enhanced feature vectors iteratively by linear combination of noisy feature vectors. The weighting for the linear combination is found by a kernel function that measures the similarity between the feature vectors. The kernel variance is a key parameter for the degree of de-noising and has to be set according to the signal-to-noise ratio (SNR). Initially, PI were proposed for speech corrupted by additive white Gaussian noise. To be independent of knowledge about the SNR and to generalize to other stationary noise types, PI are extended by automatic determination of the ...

Leitner, Christina — Graz University of Technology


Some Contributions to Music Signal Processing and to Mono-Microphone Blind Audio Source Separation

For humans, the sound is valuable mostly for its meaning. The voice is spoken language, music, artistic intent. Its physiological functioning is highly developed, as well as our understanding of the underlying process. It is a challenge to replicate this analysis using a computer: in many aspects, its capabilities do not match those of human beings when it comes to speech or instruments music recognition from the sound, to name a few. In this thesis, two problems are investigated: the source separation and the musical processing. The first part investigates the source separation using only one Microphone. The problem of sources separation arises when several audio sources are present at the same moment, mixed together and acquired by some sensors (one in our case). In this kind of situation it is natural for a human to separate and to recognize ...

Schutz, Antony — Eurecome/Mobile

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.