Informed spatial filters for speech enhancement
In modern devices which provide hands-free speech capturing functionality, such as hands-free communication kits and voice-controlled devices, the received speech signal at the microphones is corrupted by background noise, interfering speech signals, and room reverberation. In many practical situations, the microphones are not necessarily located near the desired source, and hence, the ratio of the desired speech power to the power of the background noise, the interfering speech, and the reverberation at the microphones can be very low, often around or even below 0 dB. In such situations, the comfort of human-to-human communication, as well as the accuracy of automatic speech recognisers for voice-controlled applications can be signicantly degraded. Therefore, effective speech enhancement algorithms are required to process the microphone signals before transmitting them to the far-end side for communication, or before feeding them into a speech recognition engine. This thesis is concerned with multi-microphone speech enhancement in reverberant environments, in the presence of background noise and non-stationary interferers, such as interfering speakers. The desired speech signal that needs to be enhanced is usually application-dependent and can originate from one or multiple speakers. The background noise and the non-stationary interferers, constitute undesired signals. Specific tasks of interest in this thesis are undesired signal reduction, Blind Source Separation (BSS and acoustic source detection and tracking. While single-channel speech enhancement and noise reduction have been extensively studied for more than four decades, efficient solutions to challenging problems such as BSS, acoustic source tracking, and speech enhancement in scenarios with multiple speech sources, have emerged more recently as a result of the rapid development in multi-channel speech processing and the availability of multiple microphones in commercial products, e.g., mobile phones, laptops, smart watches, hearing aids, etc. The spatial diversity provided by multiple microphones allows to reduce strong non-stationary undesired signals, while introducing little, or no distortion to the desired speech. In multi-microphone speech enhancement systems, spatio-temporal filters (beamformers) are applied to the microphone signals to obtain an estimate of the desired speech signal. A spatio-temporal filter is a processor which linearly combines the received microphone signals to provide the desired signal estimate. Commonly used optimality criteria for spatio-temporal filter design, require knowledge of the spatio-temporal Second-Order Statistics (SOS) of the desired and undesired signals received at the microphones. As the SOS are often unavailable and time-varying in practice, their estimation from the microphone signals is on of the most important factors that determine the quality of the desired signal estimate at the lter output. In general, the SOS need to be estimated in a supervised manner from the microphone signals, such that the SOS of the desired signal are estimated when the desired signal is present and the SOS of the undesired signals are estimated when the desired signal is absent. Hence, an accurate desired signal detector is a fundamental building block for implementation of data-dependent spatio-temporal filters in practice. Although the theory of optimal filter design for speech applications is a mature field, in many books and contributions it is often assumed that the SOS are available, or that they can be estimated in advance in stationary scenarios. However, the implementation of optimal filters in dynamic scenarios typical for real applications, and the importance of signal detection and online SOS estimation has been less often addressed in the literature. In this thesis, we address the design of data-dependent speech enhancement frameworks in a range of applications. We propose several application-specific frameworks, each of which constitutes of designing an appropriate desired signal detector, estimating the SOS of the desired and undesired signals using the detector output, and computing optimal data-dependent spatial filters to estimate the speech signal of interest while reducing undesired signals. As the optimal filters are computed in a supervised manner using the signal detectors, they are referred to as Informed Spatial Filters (ISFs). An underlying assumption for the design of the proposed detectors and ISFs is the speech sparsity in the Short-Time Fourier Transform (STFT) domain, which means that with a suitably chosen time and frequency resolution of the STFT, each Time-Frequency (TF) bin is dominated by either a single speech source or background noise. Based on this assumption, signal detection is performed at each TF bin, followed by an update of the SOS statistics corresponding to the dominant source at that TF bin. The estimated SOS are then used to compute optimal, time-varying ISFs to extract the desired signals. The ISFs obtained in this manner, are able to almost instantaneously adapt to changing acoustic conditions, such as time-varying locations of the desired and the undesired sources and time-varying noise statistics. Using the informed spatial filtering concept, we develop a general system which by properly designing its building blocks, is applied to a range of applications, including noise reduction, spatially selective sound acquisition, and online BSS of static and moving sources.
