Generative Speech Enhancement in Multimodal Applications

This dissertation advances generative speech enhancement by investigating both unsupervised and supervised machine learning approaches, with a focus on integrating visual information to improve robustness. The work is organized into three main contributions:

The first contribution focuses on unsupervised generative speech enhancement. We explore a Bayesian framework combining variational autoencoders (VAEs) trained on clean speech with a non-negative matrix factorization (NMF) noise model. We propose to use stochastic temporal convolutional networks (STCNs) with temporal and hierarchical latent variables to capture the dynamic structure of speech. We employ a Monte Carlo expectation-maximization algorithm for joint optimization of speech and noise parameters. Replacing the VAE with an STCN in the VAE-NMF framework enables us to learn a more expressive generative model for speech, leading to improved performance in speech enhancement. To incorporate visual cues, we propose a disentanglement learning approach for the latent variables, which allows the VAE to be conditioned on voice activity labels inferred from an audio-visual classifier. Conditioning on visual features enables the model to learn a more robust speech representation, improving the quality of the enhanced speech.

The second contribution pertains to supervised generative speech enhancement. We investigate diffusion models for high-quality speech restoration, introducing score-based generative models for speech enhancement (SGMSE), a novel method that adapts the diffusion process to learn clean speech posteriors conditioned on corrupted inputs. Notably, SGMSE is not limited to addressing additive corruptions; it is also suitable for restoring general speech communication artifacts, effectively handling diverse distortions such as background noise, reverberation, bandwidth limitation, codec artifacts, and packet loss. We provide a comprehensive review of diffusion models for audio restoration, highlighting their data-driven nature while also discussing their potential for integration into model-based approaches. We extend SGMSE to audio-visual speech enhancement by conditioning on visual features, and propose causal processing by adapting the network architecture. Additionally, we explore alternative diffusion processes, including the Schrödinger bridge, to improve efficiency and perceptual quality.

The third contribution of this dissertation is an analysis of generative speech enhancement methods in comparison to predictive approaches. We conduct evaluations of generative methods against predictive methods using the Expressive Anechoic Recordings of Speech (EARS) dataset, a high-quality 48 kHz speech corpus we curated, which encompasses a variety of speaking styles, emotional prosody, and conversational speech. To facilitate this evaluation, we create two speech enhancement benchmarks based on the EARS dataset, with controlled background noise and reverberation, respectively. Our improved model SGMSE+ consistently outperforms all baseline methods on these benchmarks across both objective measures and subjective listening evaluations.

By addressing critical research questions within unsupervised, supervised, and audio-visual frameworks, this work demonstrates the use of generative models as a powerful paradigm for speech enhancement, with significant implications for reliable communication and audio restoration.

File Type: pdf
File Size: 5 MB
Publication Year: 2025
Author : Julius Richter
Supervisors : Timo Gerkmann
Institution : University of Hamburg
Keywords : Speech Processing, Speech Enhancement, Diffusion Models, Generative Models, Signal Processing, Machine Learning, Audiovisual