Improving Speech Recognition for Pluricentric Languages exemplified on Varieties of German

A method is presented to improve speech recognition for pluricentric languages. Both the effect of adaptation of acoustic data and phonetic transcriptions for several subregions of the German speaking area are investigated and discussed. All experiments were carried out for German spoken in Germany and Austria using large telephone databases (Speech-Dat). In the first part triphone-based acoustic models (AMOs) were trained for several regions and their word error rates (WERs) were compared. The WERs vary between 9.89% and 21.78% and demonstrate the importance of regional variety adaptation. In the pronunciation modeling part narrow phonetic transcriptions for a subset of the Austrian database were carried out to derive pronunciation rules for Austrian German and to generate phonetic lexica for Austrian German which are the first of their kind. These lexica were used for both triphone-based and monophone-based AMOs with German and ...

Micha Baum — TU Graz


Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora

When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE), involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system’s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a ...

Boril, Hynek — Czech Technical University in Prague


Diplophonic Voice - Definitions, models, and detection

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been ...

Aichinger, Philipp — Division of Phoniatrics-Logopedics, Department of Otorhinolaryngology, Medical University of Vienna; Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria


Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Unsupervised and semi-supervised Non-negative Matrix Factorization methods for brain tumor segmentation using multi-parametric MRI data

Gliomas represent about 80% of all malignant primary brain tumors. Despite recent advancements in glioma research, patient outcome remains poor. The 5 year survival rate of the most common and most malignant subtype, i.e. glioblastoma, is about 5%. Magnetic resonance imaging (MRI) has become the imaging modality of choice in the management of brain tumor patients. Conventional MRI (cMRI) provides excellent soft tissue contrast without exposing the patient to potentially harmful ionizing radiation. Over the past decade, advanced MRI modalities, such as perfusion-weighted imaging (PWI), diffusion-weighted imaging (DWI) and magnetic resonance spectroscopic imaging (MRSI) have gained interest in the clinical field, and their added value regarding brain tumor diagnosis, treatment planning and follow-up has been recognized. Tumor segmentation involves the imaging-based delineation of a tumor and its subcompartments. In gliomas, segmentation plays an important role in treatment planning as well ...

Sauwen, Nicolas — KU Leuven


An Attention Model and its Application in Man-Made Scene Interpretation

The ultimate aim of research into computer vision is designing a system which interprets its surrounding environment in a similar way the human can do effortlessly. However, the state of technology is far from achieving such a goal. In this thesis different components of a computer vision system that are designed for the task of interpreting man-made scenes, in particular images of buildings, are described. The flow of information in the proposed system is bottom-up i.e., the image is first segmented into its meaningful components and subsequently the regions are labelled using a contextual classifier. Starting from simple observations concerning the human vision system and the gestalt laws of human perception, like the law of 'good (simple) shape' and 'perceptual grouping', a blob detector is developed, that identifies components in a 2D image. These components are convex regions of interest, ...

Jahangiri, Mohammad — Imperial College London


Cross-Lingual Voice Conversion

Cross-lingual voice conversion refers to the automatic transformation of a source speaker’s voice to a target speaker’s voice in a language that the target speaker can not speak. It involves a set of statistical analysis, pattern recognition, machine learning, and signal processing techniques. This study focuses on the problems related to cross-lingual voice conversion by discussing open research questions, presenting new methods, and performing comparisons with the state-of-the-art techniques. In the training stage, a Phonetic Hidden Markov Model based automatic segmentation and alignment method is developed for cross-lingual applications which support textindependent and text-dependent modes. Vocal tract transformation function is estimated using weighted speech frame mapping in more detail. Adjusting the weights, similarity to target voice and output quality can be balanced depending on the requirements of the cross- lingual voice conversion application. A context-matching algorithm is developed to reduce ...

Turk, Oytun — Bogazici University


Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech

One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to ...

Sun, Meng — Katholieke Universiteit Leuven


Mixed structural models for 3D audio in virtual environments

In the world of Information and communications technology (ICT), strategies for innovation and development are increasingly focusing on applications that require spatial representation and real-time interaction with and within 3D-media environments. One of the major challenges that such applications have to address is user-centricity, reflecting e.g. on developing complexity-hiding services so that people can personalize their own delivery of services. In these terms, multimodal interfaces represent a key factor for enabling an inclusive use of new technologies by everyone. In order to achieve this, multimodal realistic models that describe our environment are needed, and in particular models that accurately describe the acoustics of the environment and communication through the auditory modality are required. Examples of currently active research directions and application areas include 3DTV and future internet, 3D visual-sound scene coding, transmission and reconstruction and teleconferencing systems, to name but ...

Geronazzo, Michele — University of Padova


Mining the ECG: Algorithms and Applications

This research focuses on the development of algorithms to extract diagnostic information from the ECG signal, which can be used to improve automatic detection systems and home monitoring solutions. In the first part of this work, a generically applicable algorithm for model selection in kernel principal component analysis is presented, which was inspired by the derivation of respiratory information from the ECG signal. This method not only solves a problem in biomedical signal processing, but more importantly offers a solution to a long-standing problem in the field of machine learning. Next, a methodology to quantify the level of contamination in a segment of ECG is proposed. This level is used to detect artifacts, and to improve the performance of different classifiers, by removing these artifacts from the training set. Furthermore, an evaluation of three different methodologies to compute the ECG-derived ...

Varon, Carolina — KU Leuven


Oscillator-plus-Noise Modeling of Speech Signals

In this thesis we examine the autonomous oscillator model for synthesis of speech signals. The contributions comprise an analysis of realizations and training methods for the nonlinear function used in the oscillator model, the combination of the oscillator model with inverse filtering, both significantly increasing the number of `successfully' re-synthesized speech signals, and the introduction of a new technique suitable for the re-generation of the noise-like signal component in speech signals. Nonlinear function models are compared in a one-dimensional modeling task regarding their presupposition for adequate re-synthesis of speech signals, in particular considering stability. The considerations also comprise the structure of the nonlinear functions, with the aspect of the possible interpolation between models for different speech sounds. Both regarding stability of the oscillator and the premiss of a nonlinear function structure that may be pre-defined, RBF networks are found a ...

Rank, Erhard — Vienna University of Technology


Automated detection of epileptic seizures in pediatric patients based on accelerometry and surface electromyography

Epilepsy is one of the most common neurological diseases that manifests in repetitive epileptic seizures as a result of an abnormal, synchronous activity of a large group of neurons. Depending on the affected brain regions, seizures produce various severe clinical symptoms. There is no cure for epilepsy and sometimes even medication and other therapies, like surgery, vagus nerve stimulation or ketogenic diet, do not control the number of seizures. In that case, long-term (home) monitoring and automatic seizure detection would enable the tracking of the evolution of the disease and improve objective insight in any responses to medical interventions or changes in medical treatment. Especially during the night, supervision is reduced; hence a large number of seizures is missed. In addition, an alarm should be integrated into the automated seizure detection algorithm for severe seizures in order to help the ...

Milošević, Milica — KU Leuven


Deep Learning for i-Vector Speaker and Language Recognition

Over the last few years, i-vectors have been the state-of-the-art technique in speaker and language recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need speaker or/and phonetic labels for the background data, which are not easily accessible in practice. On the other hand, the lack of speaker-labeled background data makes a big performance gap, in speaker recognition, between two well-known cosine and Probabilistic Linear Discriminant Analysis (PLDA) i-vector scoring techniques. It has recently been a challenge how to fill this gap without speaker labels, which are expensive in practice. Although some unsupervised clustering techniques are proposed to estimate the speaker labels, they cannot accurately estimate the labels. This thesis tries to solve the problems above by using the DL technology in different ways, without ...

Ghahabi, Omid — Universitat Politecnica de Catalunya


Combining anatomical and spectral information to enhance MRSI resolution and quantification: Application to Multiple Sclerosis

Multiple sclerosis is a progressive autoimmune disease that a˙ects young adults. Magnetic resonance (MR) imaging has become an integral part in monitoring multiple sclerosis disease. Conventional MR imaging sequences such as fluid attenuated inversion recovery imaging have high spatial resolution, and can visualise the presence of focal white matter brain lesions in multiple sclerosis disease. Manual delineation of these lesions on conventional MR images is time consuming and su˙ers from intra and inter-rater variability. Among the advanced MR imaging techniques, MR spectroscopic imaging can o˙er complementary information on lesion characterisation compared to conventional MR images. However, MR spectroscopic images have low spatial resolution. Therefore, the aim of this thesis is to automatically segment multiple sclerosis lesions on conventional MR images and use the information from high-resolution conventional MR images to enhance the resolution of MR spectroscopic images. Automatic single time ...

Jain, Saurabh — KU Leuven

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.