Deep neural networks for source separation and noise-robust speech recognition

This thesis addresses the problem of multichannel audio source separation by exploiting deep neural networks (DNNs). We build upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources are characterized by their power spectral densities and their source spatial covariance matrices. We explore and optimize the use of DNNs for estimating these spectral and spatial parameters. Employing the estimated source parameters, we then derive a time-varying multichannel Wiener filter for the separation of each source. We extensively study the impact of various design choices for the spectral and spatial DNNs. We consider different cost functions, time-frequency representations, architectures, and training data sizes. Those cost functions notably include a newly proposed task-oriented signal-to-distortion ratio cost function for spectral DNNs. Furthermore, we present a weighted spatial parameter estimation formula, which generalizes the corresponding exact ...

Nugraha, Aditya Arie — Université de Lorraine


Robust Direction-of-Arrival estimation and spatial filtering in noisy and reverberant environments

The advent of multi-microphone setups on a plethora of commercial devices in recent years has generated a newfound interest in the development of robust microphone array signal processing methods. These methods are generally used to either estimate parameters associated with acoustic scene or to extract signal(s) of interest. In most practical scenarios, the sources are located in the far-field of a microphone array where the main spatial information of interest is the direction-of-arrival (DOA) of the plane waves originating from the source positions. The focus of this thesis is to incorporate robustness against either lack of or imperfect/erroneous information regarding the DOAs of the sound sources within a microphone array signal processing framework. The DOAs of sound sources is by itself important information, however, it is most often used as a parameter for a subsequent processing method. One of the ...

Chakrabarty, Soumitro — Friedrich-Alexander Universität Erlangen-Nürnberg


Adaptation of statistical models for single channel source separation. Application to voice / music separation in songs

Single channel source separation is a quite recent problem of constantly growing interest in the scientific world. However, this problem is still very far to be solved, and even more, it cannot be solved in all its generality. Indeed, since this problem is highly underdetermined, the main difficulty is that a very strong knowledge about the sources is required to be able to separate them. For a grand class of existing separation methods, this knowledge is expressed by statistical source models, notably Gaussian Mixture Models (GMM), which are learned from some training examples. The subject of this work is to study the separation methods based on statistical models in general, and then to apply them to the particular problem of separating singing voice from background music in mono recordings of songs. It can be very useful to propose some satisfactory ...

OZEROV, Alexey — University of Rennes 1


Speech derereverberation in noisy environments using time-frequency domain signal models

Reverberation is the sum of reflected sound waves and is present in any conventional room. Speech communication devices such as mobile phones in hands-free mode, tablets, smart TVs, teleconferencing systems, hearing aids, voice-controlled systems, etc. use one or more microphones to pick up the desired speech signals. When the microphones are not in the proximity of the desired source, strong reverberation and noise can degrade the signal quality at the microphones and can impair the intelligibility and the performance of automatic speech recognizers. Therefore, it is a highly demanded task to process the microphone signals such that reverberation and noise are reduced. The process of reducing or removing reverberation from recorded signals is called dereverberation. As dereverberation is usually a completely blind problem, where the only available information are the microphone signals, and as the acoustic scenario can be non-stationary, ...

Braun, Sebastian — Friedrich-Alexander Universität Erlangen-Nürnberg


Speech Enhancement Algorithms for Audiological Applications

The improvement of speech intelligibility is a traditional problem which still remains open and unsolved. The recent boom of applications such as hands-free communi- cations or automatic speech recognition systems and the ever-increasing demands of the hearing-impaired community have given a definitive impulse to the research in this area. This PhD thesis is focused on speech enhancement for audiological applications. Most of the research conducted in this thesis has been focused on the improvement of speech intelligibility in hearing aids, considering the variety of restrictions and limitations imposed by this type of devices. The combination of source separation techniques and spatial filtering with machine learning and evolutionary computation has originated novel and interesting algorithms which are included in this thesis. The thesis is divided in two main parts. The first one contains a preliminary study of the problem and a ...

Ayllón, David — Universidad de Alcalá


Speech Enhancement Using Nonnegative Matrix Factorization and Hidden Markov Models

Reducing interference noise in a noisy speech recording has been a challenging task for many years yet has a variety of applications, for example, in handsfree mobile communications, in speech recognition, and in hearing aids. Traditional single-channel noise reduction schemes, such as Wiener filtering, do not work satisfactorily in the presence of non-stationary background noise. Alternatively, supervised approaches, where the noise type is known in advance, lead to higher-quality enhanced speech signals. This dissertation proposes supervised and unsupervised single-channel noise reduction algorithms. We consider two classes of methods for this purpose: approaches based on nonnegative matrix factorization (NMF) and methods based on hidden Markov models (HMM). The contributions of this dissertation can be divided into three main (overlapping) parts. First, we propose NMF-based enhancement approaches that use temporal dependencies of the speech signals. In a standard NMF, the important temporal ...

Mohammadiha, Nasser — KTH Royal Institute of Technology


Probabilistic Model-Based Multiple Pitch Tracking of Speech

Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speaker short-time spectrum is obtained a priori from clean speech data. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using an interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of multiple speakers over time. The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While ...

Wohlmayr, Michael — Graz University of Technology


Robust Speech Recognition on Intelligent Mobile Devices with Dual-Microphone

Despite the outstanding progress made on automatic speech recognition (ASR) throughout the last decades, noise-robust ASR still poses a challenge. Tackling with acoustic noise in ASR systems is more important than ever before for a twofold reason: 1) ASR technology has begun to be extensively integrated in intelligent mobile devices (IMDs) such as smartphones to easily accomplish different tasks (e.g. search-by-voice), and 2) IMDs can be used anywhere at any time, that is, under many different acoustic (noisy) conditions. On the other hand, with the aim of enhancing noisy speech, IMDs have begun to embed small microphone arrays, i.e. microphone arrays comprised of a few sensors close each other. These multi-sensor IMDs often embed one microphone (usually at their rear) intended to capture the acoustic environment more than the speaker’s voice. This is the so-called secondary microphone. While classical microphone ...

López-Espejo, Iván — University of Granada


An iterative, residual-based approach to unsupervised musical source separation in single-channel mixtures

This thesis concentrates on a major problem within audio signal processing, the separation of source signals from musical mixtures when only a single mixture channel is available. Source separation is the process by which signals that correspond to distinct sources are identified in a signal mixture and extracted from it. Producing multiple entities from a single one is an extremely underdetermined task, so additional prior information can assist in setting appropriate constraints on the solution set. The approach proposed uses prior information such that: (1) it can potentially be applied successfully to a large variety of musical mixtures, and (2) it requires minimal user intervention and no prior learning/training procedures (i.e., it is an unsupervised process). This system can be useful for applications such as remixing, creative effects, restoration and for archiving musical material for internet delivery, amongst others. Here, ...

Siamantas, Georgios — University of York


Source-Filter Model Based Single Channel Speech Separation

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which find a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The first part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition ...

Stark, Michael — Graz University of Technology


Flexible Multi-Microphone Acquisition and Processing of Spatial Sound Using Parametric Sound Field Representations

This thesis deals with the efficient and flexible acquisition and processing of spatial sound using multiple microphones. In spatial sound acquisition and processing, we use multiple microphones to capture the sound of multiple sources being simultaneously active at a rever- berant recording side and process the sound depending on the application at the application side. Typical applications include source extraction, immersive spatial sound reproduction, or speech enhancement. A flexible sound acquisition and processing means that we can capture the sound with almost arbitrary microphone configurations without constraining the application at the ap- plication side. This means that we can realize and adjust the different applications indepen- dently of the microphone configuration used at the recording side. For example in spatial sound reproduction, where we aim at reproducing the sound such that the listener perceives the same impression as if he ...

Thiergart, Oliver — Friedrich-Alexander-Universitat Erlangen-Nurnberg


Deep Learning for i-Vector Speaker and Language Recognition

Over the last few years, i-vectors have been the state-of-the-art technique in speaker and language recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need speaker or/and phonetic labels for the background data, which are not easily accessible in practice. On the other hand, the lack of speaker-labeled background data makes a big performance gap, in speaker recognition, between two well-known cosine and Probabilistic Linear Discriminant Analysis (PLDA) i-vector scoring techniques. It has recently been a challenge how to fill this gap without speaker labels, which are expensive in practice. Although some unsupervised clustering techniques are proposed to estimate the speaker labels, they cannot accurately estimate the labels. This thesis tries to solve the problems above by using the DL technology in different ways, without ...

Ghahabi, Omid — Universitat Politecnica de Catalunya


Spatio-Temporal Speech Enhancement in Adverse Acoustic Conditions

Never before has speech been captured as often by electronic devices equipped with one or multiple microphones, serving a variety of applications. It is the key aspect in digital telephony, hearing devices, and voice-driven human-to-machine interaction. When speech is recorded, the microphones also capture a variety of further, undesired sound components due to adverse acoustic conditions. Interfering speech, background noise and reverberation, i.e. the persistence of sound in a room after excitation caused by a multitude of reflections on the room enclosure, are detrimental to the quality and intelligibility of target speech as well as the performance of automatic speech recognition. Hence, speech enhancement aiming at estimating the early target-speech component, which contains the direct component and early reflections, is crucial to nearly all speech-related applications presently available. In this thesis, we compare, propose and evaluate existing and novel approaches ...

Dietzen, Thomas — KU Leuven


Acoustic sensor network geometry calibration and applications

In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization ...

Plinge, Axel — TU Dortmund University


Integrating monaural and binaural cues for sound localization and segregation in reverberant environments

The problem of segregating a sound source of interest from an acoustic background has been extensively studied due to applications in hearing prostheses, robust speech/speaker recognition and audio information retrieval. Computational auditory scene analysis (CASA) approaches the segregation problem by utilizing grouping cues involved in the perceptual organization of sound by human listeners. Binaural processing, where input signals resemble those that enter the two ears, is of particular interest in the CASA field. The dominant approach to binaural segregation has been to derive spatially selective filters in order to enhance the signal in a direction of interest. As such, the problems of sound localization and sound segregation are closely tied. While spatial filtering has been widely utilized, substantial performance degradation is incurred in reverberant environments and more fundamentally, segregation cannot be performed without sufficient spatial separation between sources. This dissertation ...

Woodruff, John — The Ohio State University

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.