Combining Model-Based and Learning-Based Approaches for Speech Enhancement
In many speech communication devices, such as smartphones, smartspeakers, and hearing devices, the microphones capture not only the target speaker but also undesired ambient noise, degrading speech quality and speech intelligibility. Speech enhancement algorithms aim at extracting the target speech from the recorded microphone signals by suppressing noise while not distorting the target speech. Over the past decade, there has been a shift from model-based statistical signal processing approaches to learning-based data-driven approaches. Although model-based approaches offer interpretability and theoretical guarantees, they often struggle in complex, real-world acoustic scenarios where their assumptions are violated. In contrast, learning-based approaches generally achieve higher performance in such scenarios due to their strong representation capacity but may lack interpretability, theoretical guarantees, and robustness when the data observed during inference does not match the training data.
Motivated by the potential to combine the interpretability of model-based approaches with the strong representation capacity of learning-based approaches, the primary objective of this thesis is to develop and evaluate hybrid speech enhancement algorithms that employ a learning-based stage to estimate quantities required by a model-based enhancement stage. The main focus is on investigating whether imposing structure on the estimated quantities—such as correlation matrix structure, correlation vector structure, or spatial structure—improves speech enhancement performance, interpretability, and computational complexity. Another focus is on developing geometry-robust hybrid speech enhancement algorithms that can operate with arbitrary microphone array configurations. While the developed algorithms can be used for various speech enhancement applications, our focus is on hearing devices, where low latency is crucial. To this end, we mainly consider causal multi-frame filters in the short-time Fourier transform domain as the model-based enhancement stage, leveraging their inherent low-latency capabilities.
As a first contribution, we propose a hybrid single-microphone speech enhancement approach by embedding the multi-frame minimum variance distortionless response (MFMVDR) filter within a deep learning framework, imposing structure on the required temporal covariance matrices. Simulation results using the deep noise suppression (DNS) 1 challenge dataset demonstrate that the resulting deep MFMVDR filter improves speech enhancement performance compared to a purely learning-based algorithm that does not impose the MFMVDR structure on the filter coefficients. Additionally, imposing structure on the temporal covariance matrices reduces computational complexity while maintaining speech enhancement performance.
Second, we extend the hybrid single-microphone approach to multi-microphone speech enhancement for binaural hearing devices by embedding the binaural spatio-temporal Wiener filter within a deep learning framework, imposing structure on the required spatio-temporal correlation vectors. Simulation results using the DNS 1, DNS 2, CEC 1, and CEC 3 datasets demonstrate that the Kronecker factorization of the speech spatio-temporal correlation vectors into a spatial factor (the relative transfer function (RTF) vector) and a temporal factor reduces computational complexity while maintaining speech enhancement performance and preserving binaural cues, outperforming two causal state-of-the-art binaural speech enhancement algorithms.
Third, we investigate the acoustic interpretability of the estimated RTF vector in the Kronecker factorization of the speech spatio-temporal correlation vector. Since the estimated RTF vector does not reflect the spatial characteristics of the acoustic scenario, we propose a spatial regularization procedure to improve interpretability by imposing spatial structure. Simulation results using the CHiME-3 microphone array demonstrate that the proposed spatial regularization procedure yields accurate estimates of the RTF vector even in reverberant environments without sacrificing speech enhancement performance or increasing computational complexity.
Finally, we propose three procedures to improve the robustness of the mask-based beamformer with attention-based spatial covariance matrix aggregator (ASA) against varying microphone array configurations. These procedures include incorporating random microphone array configurations during training, employing the transform-average-concatenate (TAC) method, and using geometry-robust input features. Simulation results for a moving source using the CHiME-3 and DEMAND microphone arrays demonstrate that the combination of these procedures enables the application to unseen microphone array configurations, consistently outperforming both a baseline mask-based beamformer with recursive smoothing and the original mask-based beamformer with ASA.
