Single-Microphone Multi-Frame Speech Enhancement Exploiting Speech Interframe Correlation

Speech communication devices such as hearing aids or mobile phones are often used in acoustically challenging situations, where the desired speech signal is affected by undesired background noise. Since in these situations speech quality and speech intelligibility may be degraded, speech enhancement algorithms are required to suppress the undesired background noise, while preserving the desired speech signal. In this thesis, we focus on single-microphone speech enhancement algorithms in the short-time Fourier transform domain, more in particular on multi-frame algorithms that aim at exploiting speech correlation across time-frames. In principle, exploiting the speech interframe correlation enables to suppress the undesired background noise, while keeping speech distortion low. Existing single-microphone multi-frame speech enhancement algorithms, such as the multi-frame minimum variance distortionless response (MFMVDR) filter and the multi-frame minimum power distortionless response (MFMPDR) filter, depend on the normalized speech correlation vector, which is highly time-varying and hence difficult to be accurately estimated. The main objective of this thesis is to develop and evaluate novel robust methods to estimate the normalized speech correlation vector from the noisy microphone signal, either based on robust beamforming approaches or exploiting a low-rank speech model. First, in order to better understand the performance of the MFMVDR and MFMPDR filters, we investigate the sensitivity of both filters to estimation errors in the normalized speech correlation vector.We compare the practically feasible MFMPDR filter with two oracle versions of the MFMVDR filter for different oracle and blind estimates of the normalized speech correlation vector. Simulation results show that accurately estimating the normalized speech correlation vector is crucial, since even small estimation errors degrade the performance of the MFMVDR and MFMPDR filters, resulting in speech distortion and unpleasant artifacts in the background noise. Second, in order to improve the robustness of the practical feasible MFMPDR filter against estimation errors in the normalized speech correlation vector, we investigate the potential of using concepts from robust MPDR beamforming in the context of single-microphone multi-frame speech enhancement. We propose two constrained MFMPDR filters that estimate the normalized speech correlation vector as the vector maximizing the total signal output power spectral density within a spherical uncertainty set. This corresponds to imposing a quadratic inequality constraint on the mismatch vector with respect to the presumed normalized speech correlation vector, e.g., the state-of-the-art maximum-likelihood (ML) estimate. Whereas the singly-constrained (SC) MFMPDR filter only considers the quadratic inequality constraint to estimate the (non-normalized) speech correlation vector, the doubly-constrained (DC) MFMPDR filter integrates a linear normalization constraint into the optimization problem to directly estimate the normalized speech correlation vector. The main novelty is to set the upper bound of the spherical uncertainty set using a trained non-linear mapping function that depends on the time-varying a-priori SNR estimate for each time-frequency point. Simulation results show that the proposed constrained approaches yield a more accurate estimate of the normalized speech correlation vector than the ML estimate. An instrumental and a perceptual evaluation show that both constrained MFMPDR filters lead to a more natural speech quality and less noise distortion, but a more conservative noise reduction performance than the state-of-the-art ML-MFMPDR filter, where the DC-MFMPDR filter is preferred in terms of overall quality compared to the SC-MFMPDR filter and the ML-MFMPDR filter. Third, assuming that speech signals can be modeled using a low-rank model, we propose two matrix-based methods to estimate the normalized speech correlation vector, namely the matrix-subtraction (MS) method and the subspace-decomposition (SD) method. Both methods are based on the eigenvalue decomposition of a matrix, which is either constructed by subtracting the estimated normalized noise correlation matrix from the estimated normalized noisy speech correlation matrix or by prewhitening the estimated normalized noisy speech correlation matrix with the estimated normalized noise correlation matrix. We propose to estimate the speech model order for each time-frequency point by incorporating the a-priori SNR into the minimum description length selection criterion. Simulation results show that the proposed matrix-based SD method yields a more accurate estimate of the normalized speech correlation vector than the vector-based ML estimate. Instrumental performance measures indicate that the MFMPDR filter using the proposed SD estimator leads to a better speech quality and more noise reduction than the ML-MFMPDR filter, while keeping speech distortion low. Finally, the results of a subjective listening test confirm that the overall quality for the MFMPDR filters using the proposed SD estimator and the proposed DC estimator are significantly better than for the state-of-the-art ML-MFMPDR filter.

File Type: pdf
File Size: 4 MB
Publication Year: 2020
Author: D?rte Fischer
Supervisors: Simon Doclo
Institution: University of Oldenburg, Germany
Keywords: single-microphone speech enhancement, multi-frame filters, interframe correlation, spherical uncertainty set, low-rank model, subspace decomposition