Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning

In natural listening environments, speech signals are easily distorted by variousacoustic interference, which reduces the speech quality and intelligibility of human listening; meanwhile, it makes difficult for many speech-related applications, such as automatic speech recognition (ASR). Thus, many speech enhancement (SE) algorithms have been developed in the past decades. However, most current SE algorithms are difficult to capture underlying speech information (e.g., phoneme) in the SE process. This causes it to be challenging to know what specific information is lost or interfered with in the SE process, which limits the application of enhanced speech. For instance, some SE algorithms aimed to improve human listening usually damage the ASR system. The objective of this dissertation is to develop SE algorithms that have the potential to capture various underlying speech representations (information) and improve the quality and intelligibility of noisy speech. This ...

Xiang, Yang — Aalborg University, Capturi A/S


Speech Enhancement Using Nonnegative Matrix Factorization and Hidden Markov Models

Reducing interference noise in a noisy speech recording has been a challenging task for many years yet has a variety of applications, for example, in handsfree mobile communications, in speech recognition, and in hearing aids. Traditional single-channel noise reduction schemes, such as Wiener filtering, do not work satisfactorily in the presence of non-stationary background noise. Alternatively, supervised approaches, where the noise type is known in advance, lead to higher-quality enhanced speech signals. This dissertation proposes supervised and unsupervised single-channel noise reduction algorithms. We consider two classes of methods for this purpose: approaches based on nonnegative matrix factorization (NMF) and methods based on hidden Markov models (HMM). The contributions of this dissertation can be divided into three main (overlapping) parts. First, we propose NMF-based enhancement approaches that use temporal dependencies of the speech signals. In a standard NMF, the important temporal ...

Mohammadiha, Nasser — KTH Royal Institute of Technology


Models and Software Realization of Russian Speech Recognition based on Morphemic Analysis

Above 20% European citizens speak in Russian therefore the task of automatic recognition of Russian continuous speech has a key significance. The main problems of ASR are connected with the complex mechanism of Russian word-formation. Totally there exist above 3 million diverse valid word-forms that is very large vocabulary ASR task. The thesis presents the novel HMM-based ASR model of Russian that has morphemic levels of speech and language representation. The model includes the developed methods for decomposition of the word vocabulary into morphemes and acoustical and statistical language modelling at the training stage and the method for word synthesis at the last stage of speech decoding. The presented results of application of the ASR model for voice access to the Yellow Pages directory have shown the essential improvement (above 75%) of the real-time factor saving acceptable word recognition rate ...

Karpov, Alexey — St.Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences


Unsupervised and semi-supervised Non-negative Matrix Factorization methods for brain tumor segmentation using multi-parametric MRI data

Gliomas represent about 80% of all malignant primary brain tumors. Despite recent advancements in glioma research, patient outcome remains poor. The 5 year survival rate of the most common and most malignant subtype, i.e. glioblastoma, is about 5%. Magnetic resonance imaging (MRI) has become the imaging modality of choice in the management of brain tumor patients. Conventional MRI (cMRI) provides excellent soft tissue contrast without exposing the patient to potentially harmful ionizing radiation. Over the past decade, advanced MRI modalities, such as perfusion-weighted imaging (PWI), diffusion-weighted imaging (DWI) and magnetic resonance spectroscopic imaging (MRSI) have gained interest in the clinical field, and their added value regarding brain tumor diagnosis, treatment planning and follow-up has been recognized. Tumor segmentation involves the imaging-based delineation of a tumor and its subcompartments. In gliomas, segmentation plays an important role in treatment planning as well ...

Sauwen, Nicolas — KU Leuven


Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University


Nonnegative Matrix and Tensor Factorizations: Models, Algorithms and Applications

In many fields, such as linear algebra, computational geometry, combinatorial optimization, analytical chemistry and geoscience, nonnegativity of the solution is required, which is either due to the fact that the data is physically nonnegative, or that the mathematical modeling of the problem requires nonnegativity. Image and audio processing are two examples for which the data are physically nonnegative. Probability and graph theory are examples for which the mathematical modeling requires nonnegativity. This thesis is about the nonnegative factorization of matrices and tensors: namely nonnegative matrix factorization (NMF) and nonnegative tensor factorization (NTF). NMF problems arise in a wide range of scenarios such as the aforementioned fields, and NTF problems arise as a generalization of NMF. As the title suggests, the contributions of this thesis are centered on NMF and NTF over three aspects: modeling, algorithms and applications. On the modeling ...

Ang, Man Shun — Université de Mons


Sound Source Separation in Monaural Music Signals

Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more efficiently and flexibly than polyphonic mixtures. This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling. Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is modeled ...

Virtanen, Tuomas — Tampere University of Technology


Novel texture synthesis methods and their application to image prediction and image inpainting

This thesis presents novel exemplar-based texture synthesis methods for image prediction (i.e., predictive coding) and image inpainting problems. The main contributions of this study can also be seen as extensions to simple template matching, however the texture synthesis problem here is well-formulated in an optimization framework with different constraints. The image prediction problem has first been put into sparse representations framework by approximating the template with a sparsity constraint. The proposed sparse prediction method with locally and adaptive dictionaries has been shown to give better performance when compared to static waveform (such as DCT) dictionaries, and also to the template matching method. The image prediction problem has later been placed into an online dictionary learning framework by adapting conventional dictionary learning approaches for image prediction. The experimental observations show a better performance when compared to H.264/AVC intra and sparse prediction. ...

Turkan, Mehmet — INRIA-Rennes, France


Hierarchical Language Modeling for One-Stage Stochastic Interpretation of Natural Speech

The thesis deals with automatic interpretation of naturally spoken utterances for limited-domain applications. Specifically, the problem is examined by means of a dialogue system for an airport information application. In contrast to traditional two-stage systems, speech recognition and semantic processing are tightly coupled. This avoids interpretation errors due to early decisions. The presented one-stage decoding approach utilizes a uniform, stochastic knowledge representation based on weighted transition network hierarchies, which describe phonemes, words, word classes and semantic concepts. A robust semantic model, which is estimated by combination of data-driven and rule-based approaches, is part of this representation. The investigation of this hierarchical language model is the focus of this work. Furthermore, methods for modeling out-of-vocabulary words and for evaluating semantic trees are introduced.

Thomae, Matthias — Technische Universität München


Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora

When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE), involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system’s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a ...

Boril, Hynek — Czech Technical University in Prague


On Bayesian Methods for Black-Box Optimization: Efficiency, Adaptation and Reliability

Recent advances in many fields ranging from engineering to natural science, require increasingly complicated optimization tasks in the experiment design, for which the target objectives are generally in the form of black-box functions that are expensive to evaluate. In a common formulation of this problem, a designer is expected to solve the black-box optimization tasks via sequentially attempting candidate solutions and receiving feedback from the system. This thesis considers Bayesian optimization (BO) as the black-box optimization framework, and investigates the enhancements on BO from the aspects of efficiency, adaptation and reliability. Generally, BO consists of a surrogate model for providing probabilistic inference and an acquisition function which leverages the probabilistic inference for selecting the next candidate solution. Gaussian process (GP) is a prominent non-parametric surrogate model, and the quality of its inference is a critical factor on the optimality performance ...

Zhang, Yunchuan — King's College London


Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...

Bahari, Mohamad Hasan — KU Leuven


Acoustic sensor network geometry calibration and applications

In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization ...

Plinge, Axel — TU Dortmund University


Confidence Measures for Speech/Speaker Recognition and Applications on Turkish LVCSR

Con dence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Con dence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system. Speech/speaker recognition systems are usually based on statistical modeling techniques. In this thesis we de ned con dence measures for statistical modeling techniques used in speech/speaker recognition systems. For speech recognition we tested available con dence measures and the newly de ned acoustic prior information based con dence measure in two di erent conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly de ned con dence measure performs better in both tests. Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis. We de ned also ...

Mengusoglu, Erhan — Universite de Mons


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.