Model-based Techniques and Diffusion Models for Speech Dereverberation
Reverberation occurs in most of our environments and often degrades the intelligibility and quality of human speech, with an aggravated effect on hearing-impaired listeners. Meanwhile, the evolution of technologies for multimedia entertainment, communications and medical applications has led to a greater demand for improved sound quality. Therefore, many embedded devices now include a dereverberation algorithm, which aims to recover the anechoic component of speech. Dereverberation is an arduous task and an ill-posed inverse problem: even perfectly knowing the room acoustics does not guarantee to obtain a perfectly dereverberated signal. Furthermore, in most real-life cases, such knowledge is not available and therefore most dereverberation algorithms are blind, i.e. they must extract information from the reverberant speech signal only. Traditional dereverberation algorithms derive anechoic speech estimators exploiting statistical properties of speech signals, distributional assumptions and even knowledge of room acoustics when available. Traditional methods are efficient in quiet environments where reverberation and background noise are mild, but fail to perform satisfyingly when conditions become more adverse or when assumptions underlying their derivations do not hold. Given the recent shift toward data-driven deep learning, numerous speech dereverberation algorithms now rely on the impressive modelling capabilities of deep neural networks (DNNs). These powerful non-linear estimators allow learning-based approaches to largely outperform their traditional counterparts on tasks as difficult as single-channel blind speech dereverberation in the presence of non-stationary measurement noise. However, DNN-based algorithms require more computing resources and often suffer from poor adaptability to conditions unseen in their training data, leading to different failure cases than traditional techniques. Furthermore, relying solely on DNN-based learning approaches carries the risk of reducing interpretability, thus failing to provide guarantees with respect to user safety and fairness. The opening chapter of this thesis focuses on model-based learning, i.e. hybrid paradigms combining DNNs with domain knowledge such as speech statistical properties, room acoustics or traditional algorithm structures. In the first publication, we present a real-time capable two-stage algorithm combining traditional speech dereverberation and lightweight DNNs. In the initial stage, a DNN-assisted multi-channel linear prediction method removes most of the moderate reverberation accessible within the auto-regressive filter length. The second stage then extracts the target speech by suppressing the statistically uncorrelated residual reverberation from the output of the first stage. The other technique presented in this chapter leverages the signal models behind speech denoising and dereverberation. There, we extend time-frequency masking DNNs to deep filters performing multi-frame filtering in frequency subbands. We observe that deep filters perform better on dereverberation than single-frame masking, as one would intuitively expect from the ideas underlying subband filtering for dereverberation. In contrast, the performance of both approaches is similar when only background noise is present. In the second chapter, we investigate conditional diffusion-based generative models for speech dereverberation, and their relationship to supervised learning and predictive models. Con- ditional generative models estimate the posterior distribution of anechoic speech given a reverberant recording, in contrast with predictive models that learn a regression rule between reverberant and anechoic speech. We introduce this chapter with a tutorial on conditional diffusion models for audio restoration. The second contribution is a comparative analysis of predictive methods versus diffusion-based generative models. We contextualize this com- parison with respect to various speech restoration tasks such as denoising, dereverberation and bandwidth extension. The study suggests that diffusion models consistently outperform their predictive counterparts across all tasks, and that the quality difference is larger for non-additive degradation models such as reverberation and bandwidth extension. Our next work leverages this analysis to combine predictive and diffusion-based generative models in a principled fashion. We demonstrate that using a predictive model estimate as an intermediate step before diffusion-based generation yields remarkable speech enhancement and dereverbera- tion performance, while simultaneously reducing computational costs compared to traditional diffusion models. The publications in the last chapter of this dissertation treat dereverberation as an inverse problem. Our initial contribution presents an unsupervised method for informed derever- beration, where diffusion models are applied as unconditional speech priors in Bayesian posterior sampling. We observe that the diffusion-based prior is an effective regularizer for inverse problem solving, yielding state-of-the-art dereverberation performance when the room acoustics are perfectly known. The second work extends the former to the blind scenario where room acoustics are unknown. Rooted in statistical observations of room properties, we propose to represent the room impulse response by a subband filter with frequency-dependent exponential decays. The resulting approach performs joint dereverberation and room impulse response estimation without any supervision during training. It boasts a natural adaptability to new reverberant environments because of the lack of supervision at training time, unlike supervised algorithms whose performance dwindles when acoustic conditions at test time are different from those seen during training. In conclusion, this dissertation conducts a principled investigation of DNN-assisted speech dereverberation, ranging from model-based techniques to recent advances in diffusion-based generative models. We continually discuss the applicability of the methods presented in this thesis to real-life applications, with a particular focus on hearing devices. Through the various analyses run in this thesis, we provide evidence that injecting domain knowledge in DNN-based techniques is instrumental in providing interpretable and efficient speech dereverberation algorithms.
