Non-linear Spatial Filtering for Multi-channel Speech Enhancement

A large part of human speech communication takes place in noisy environments and is supported by technical devices. For example, a hearing-impaired person might use a hearing aid to take part in a conversation in a busy restaurant. These devices, but also telecommunication in noisy environments or voiced-controlled assistants, make use of speech enhancement and separation algorithms that improve the quality and intelligibility of speech by separating speakers and suppressing background noise as well as other unwanted effects such as reverberation. If the devices are equipped with more than one microphone, which is very common nowadays, then multi-channel speech enhancement approaches can leverage spatial information in addition to single-channel tempo-spectral information to perform the task. Traditionally, linear spatial filters, so-called beamformers, have been employed to suppress the signal components from other than the target direction and thereby enhance the desired speech signal. Since the noise reduction is insufficient in acoustically challenging scenarios, a beamformer for spatial filtering is often combined with a single-channel tempo-spectral post-filter. In single-channel speech enhancement and separation, approaches based on deep neural networks (DNNs) have been dominating the research landscape for some time. On the other hand, in multi-channel speech enhancement and separation, a change is currently taking place. Initially, DNNs were only integrated into multi-channel systems for tempo-spectral modeling, e.g., for estimating the beamformer parameters, but the spatial processing continued to be performed with a linear beamformer. Today, however, the number of publications that propose to replace the traditional pipeline with end-to-end trained DNNs is steadily increasing. With such an approach, DNNs can be used to realize a filter that integrates both spatial and temporal-spectral processing into a single non-linear operation. Such joint spatial and tempo-spectral non-linear filters are the subject of this thesis and referred to as non-linear spatial filters. The first part of the thesis aims to clarify the benefits that an analytic non-linear spatial filter can offer compared to the traditional beamformer plus post-filter pipeline from a statistical perspective. A better understanding of the properties of non-linear spatial filters helps to decide if and in which situation a (DNN-based) non-linear spatial filter should replace the traditional approaches. Based on analytical estimators, we show that a non-linear spatial filter outperforms a beamformer plus post-filter approach if the noise distribution is non-Gaussian. Furthermore, by means of experiments, we demonstrate that the non-linear spatial filter enables a more powerful spatial processing that is not bound to the theoretical limits of a linear approach. The second part focuses on the design and analysis of DNN-based joint spatial and tempo-spectral non-linear filters. We analyze the dependencies between the three available sources of information (spatial, spectral, and temporal) and find that the correlations between the frequency bands are particularly important for achieving a high spatial selectivity. Regarding the network architecture, this implies that spatial and spectral information should be processed together at an early stage. The DNN-based non-linear spatial filter designed according to this principle significantly outperforms an oracle beamformer plus DNN-based post-filter in difficult scenarios with a high number of interfering speakers and a low number of microphones. In the third part of the thesis, we add a steering mechanism to the DNN-based non-linear spatial filter so that it can be steered in a chosen target direction. We apply the steerable filter to speech separation tasks and find that the explicit focus on the spatial selectivity of the filter during training is not only beneficial for the overall separation performance but also leads to an improved generalization ability compared to a similar network trained based on permutation invariant training (PIT). As a result, this thesis not only contributes to a better theoretical understanding of non-linear spatial filters and their performance potential, but it also investigates various aspects of a practical implementation using DNNs. The research ultimately culminates in the development of a real-time demonstration of a DNN-based non-linear spatial filter.

File Type: pdf
File Size: 4 MB
Publication Year: 2023
Author: Tesch, Kristina
Supervisors: Timo Gerkmann
Institution: Universit?t Hamburg
Keywords: speech processing, machine learning, multi-channel, speech enhancement, speech separation