Robust Speech Recognition on Intelligent Mobile Devices with Dual-Microphone (2017)
Deep neural networks for source separation and noise-robust speech recognition
This thesis addresses the problem of multichannel audio source separation by exploiting deep neural networks (DNNs). We build upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources are characterized by their power spectral densities and their source spatial covariance matrices. We explore and optimize the use of DNNs for estimating these spectral and spatial parameters. Employing the estimated source parameters, we then derive a time-varying multichannel Wiener filter for the separation of each source. We extensively study the impact of various design choices for the spectral and spatial DNNs. We consider different cost functions, time-frequency representations, architectures, and training data sizes. Those cost functions notably include a newly proposed task-oriented signal-to-distortion ratio cost function for spectral DNNs. Furthermore, we present a weighted spatial parameter estimation formula, which generalizes the corresponding exact ...
Nugraha, Aditya Arie — Université de Lorraine
Robust Direction-of-Arrival estimation and spatial filtering in noisy and reverberant environments
The advent of multi-microphone setups on a plethora of commercial devices in recent years has generated a newfound interest in the development of robust microphone array signal processing methods. These methods are generally used to either estimate parameters associated with acoustic scene or to extract signal(s) of interest. In most practical scenarios, the sources are located in the far-field of a microphone array where the main spatial information of interest is the direction-of-arrival (DOA) of the plane waves originating from the source positions. The focus of this thesis is to incorporate robustness against either lack of or imperfect/erroneous information regarding the DOAs of the sound sources within a microphone array signal processing framework. The DOAs of sound sources is by itself important information, however, it is most often used as a parameter for a subsequent processing method. One of the ...
Chakrabarty, Soumitro — Friedrich-Alexander Universität Erlangen-Nürnberg
Non-linear Spatial Filtering for Multi-channel Speech Enhancement
A large part of human speech communication takes place in noisy environments and is supported by technical devices. For example, a hearing-impaired person might use a hearing aid to take part in a conversation in a busy restaurant. These devices, but also telecommunication in noisy environments or voiced-controlled assistants, make use of speech enhancement and separation algorithms that improve the quality and intelligibility of speech by separating speakers and suppressing background noise as well as other unwanted effects such as reverberation. If the devices are equipped with more than one microphone, which is very common nowadays, then multi-channel speech enhancement approaches can leverage spatial information in addition to single-channel tempo-spectral information to perform the task. Traditionally, linear spatial filters, so-called beamformers, have been employed to suppress the signal components from other than the target direction and thereby enhance the desired ...
Tesch, Kristina — Universität Hamburg
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. ...
Ravanelli, Mirco — Fondazione Bruno Kessler
Contributions to Single-Channel Speech Enhancement with a Focus on the Spectral Phase
Single-channel speech enhancement refers to the reduction of noise signal components in a single-channel signal composed of both speech and noise. Spectral speech enhancement methods are among the most popular approaches to solving this problem. Since the short-time spectral amplitude has been identified as a highly perceptually relevant quantity, most conventional approaches rely on processing the amplitude spectrum only, ignoring any information that may be contained in the spectral phase. As a consequence, the noisy short-time spectral phase is neither enhanced for the purpose of signal reconstruction nor is it used for refining short-time spectral amplitude estimates. This thesis investigates the use of the spectral phase and its structure in algorithms for single-channel speech enhancement. This includes the analysis of the spectral phase in the context of theoretically optimal speech estimators. The resulting knowledge is exploited in formulating single-channel speech ...
Johannes Stahl — Graz University of Technology
Speech derereverberation in noisy environments using time-frequency domain signal models
Reverberation is the sum of reflected sound waves and is present in any conventional room. Speech communication devices such as mobile phones in hands-free mode, tablets, smart TVs, teleconferencing systems, hearing aids, voice-controlled systems, etc. use one or more microphones to pick up the desired speech signals. When the microphones are not in the proximity of the desired source, strong reverberation and noise can degrade the signal quality at the microphones and can impair the intelligibility and the performance of automatic speech recognizers. Therefore, it is a highly demanded task to process the microphone signals such that reverberation and noise are reduced. The process of reducing or removing reverberation from recorded signals is called dereverberation. As dereverberation is usually a completely blind problem, where the only available information are the microphone signals, and as the acoustic scenario can be non-stationary, ...
Braun, Sebastian — Friedrich-Alexander Universität Erlangen-Nürnberg
Distributed Signal Processing Algorithms for Multi-Task Wireless Acoustic Sensor Networks
Recent technological advances in analogue and digital electronics as well as in hardware miniaturization have taken wireless sensing devices to another level by introducing low-power communication protocols, improved digital signal processing capabilities and compact sensors. When these devices perform a certain pre-defined signal processing task such as the estimation or detection of phenomena of interest, a cooperative scheme through wireless connections can significantly enhance the overall performance, especially in adverse conditions. The resulting network consisting of such connected devices (or nodes) is referred to as a wireless sensor network (WSN). In acoustical applications (e.g., speech enhancement) a variant of WSNs, called wireless acoustic sensor networks (WASNs) can be employed in which the sensing unit at each node consists of a single microphone or a microphone array. The nodes of such a WASN can then cooperate to perform a multi-channel acoustic ...
Hassani, Amin — KU Leuven
Advances in DFT-Based Single-Microphone Speech Enhancement
The interest in the field of speech enhancement emerges from the increased usage of digital speech processing applications like mobile telephony, digital hearing aids and human-machine communication systems in our daily life. The trend to make these applications mobile increases the variety of potential sources for quality degradation. Speech enhancement methods can be used to increase the quality of these speech processing devices and make them more robust under noisy conditions. The name "speech enhancement" refers to a large group of methods that are all meant to improve certain quality aspects of these devices. Examples of speech enhancement algorithms are echo control, bandwidth extension, packet loss concealment and noise reduction. In this thesis we focus on single-microphone additive noise reduction and aim at methods that work in the discrete Fourier transform (DFT) domain. The main objective of the presented research ...
Hendriks, Richard Christian — Delft University of Technology
Acoustic sensor network geometry calibration and applications
In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization ...
Plinge, Axel — TU Dortmund University
Geometry-aware sound source localization using neural networks
Sound Source Localization (SSL) is the topic within acoustic signal processing which studies methods for the estimation of the position of one or more active sound sources in space, such as human talkers, using signals captured by one or more microphone arrays. It has many applications, including robot orientation, speech enhancement and diarization. Although signal processing-based algorithms have been the standard choice for SSL over past decades, deep neural networks have recently achieved state-of-the-art performance for this task. A drawback of most deep learning-based SSL methods consists of requiring the training and testing microphone and room geometry to be matched, restricting practical applications of available models. This is particularly relevant when using Distributed Microphone Arrays (DMAs), whose positions are usually set arbitrarily and may change with time. Flexibility to microphone geometry is also desirable for companies maintaining multiple types of ...
Grinstein, Eric — Imperial College London
The Removal of Environmental Noise in Cellular Communications by Perceptual Techniques
This thesis describes the application of a perceptually based spectral subtraction algorithm for the enhancement of non-stationary noise corrupted speech. Through examination of speech enhancement techniques, explanations are given for the choice of magnitude spectral subtraction and how the human auditory system can be modelled for frequency domain speech enhancement. It is discovered, that the cochlea provides the mechanical speech enhancement in the auditory system, through the use of masking. Frequency masking is used in spectral subtraction, to improve the algorithm execution time, and to shape the enhancement process making it sound natural to the ear. A new technique for estimation of background noise is presented, which operates during speech sections as well as pauses. This uses two microphones placed on opposite ends of the cellular handset. Using these, the algorithm determines whether the signal is speech, or noise, by ...
Tuffy, Mark — University Of Edinburgh
Noise Robust ASR: Missing data techniques and beyond
Speech recognition performance degrades in the presence of background noise. In this thesis, several methods are developed to improve the noise robustness. Most of the work pertains to the use of sparse representations of speech: speech segments are described as a sparse linear combination of example speech segments, exemplars. Using techniques from missing data theory and compressed sensing, it is proposed to find, for each noisy speech observation, a sparse linear combination of exemplars using only speech features that are not corrupted by noise. This linear combination of clean speech exemplars is then used to reconstruct and estimate of the clean speech. Later in the thesis, it is proposed to augment this model by expressing noisy speech as a linear combination of speech and noise exemplars. Additionally, the weights of labelled exemplars in the sparse representation is used directly for ...
Gemmeke, Jort — Radboud University Nijmegen
Post-Filter Optimization for Multichannel Automotive Speech Enhancement
In an automotive environment, quality of speech communication using a hands-free equipment is often deteriorated by interfering car noise. In order to preserve the speech signal without car noise, a multichannel speech enhancement system including a beamformer and a post-filter can be applied. Since employing a beamformer alone is insufficient to substantially reducing the level of car noise, a post-filter has to be applied to provide further noise reduction, especially at low frequencies. In this thesis, two novel post-filter designs along with their optimization for different driving conditions are presented. The first post-filter design utilizes an adaptive smoothing factor for the power spectral density estimation as well as a hybrid noise coherence function. The hybrid noise coherence function is a mixture of the diffuse and the measured noise coherence functions for a specific driving condition. The second post-filter design applies ...
Yu, Huajun — Technische Universität Braunschweig
Fundamental Frequency and Direction-of-Arrival Estimation for Multichannel Speech Enhancement
Audio systems receive the speech signals of interest usually in the presence of noise. The noise has profound impacts on the quality and intelligibility of the speech signals, and it is therefore clear that the noisy signals must be cleaned up before being played back, stored, or analyzed. We can estimate the speech signal of interest from the noisy signals using a priori knowledge about it. A human speech signal is broadband and consists of both voiced and unvoiced parts. The voiced part is quasi-periodic with a time-varying fundamental frequency (or pitch as it is commonly referred to). We consider the periodic signals basically as the sum of harmonics. Therefore, we can pass the noisy signals through bandpass filters centered at the frequencies of the harmonics to enhance the signal. In addition, although the frequencies of the harmonics are the ...
Karimian-Azari, Sam — Aalborg Univeristy
Enhancement of Periodic Signals: with Application to Speech Signals
The topic of this thesis is the enhancement of noisy, periodic signals with application to speech signals. Generally speaking, enhancement methods can be divided into signal- and noise-driven methods. In this thesis, we focus on the signal-driven approach by employing relevant signal parameters for the enhancement of periodic signals. The enhancement problem consists of two major subproblems: the estimation of relevant parameters or statistics, and the actual noise reduction of the observed signal. We consider both of these subproblems. First, we consider the problem of estimating signal parameters relevant to the enhancement of periodic signals. The fundamental frequency is one example of such a parameter. Furthermore, in multichannel scenarios, the direction-of-arrival of the periodic sources onto an array of sensors is another parameter of relevance. We propose methods for the estimation of the fundamental frequency that have benefits compared to ...
Jensen, Jesper Rindom — Aalborg University
The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.
The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.