Enhancement of Speech Signals – with a Focus on Voiced Speech Models
The topic of this thesis is speech enhancement with a focus on models of voiced speech. Speech is divided into two subcategories dependent on the characteristics of the signal. One part is the voiced speech, the other is the unvoiced. In this thesis, we primarily focus on the voiced speech parts and utilise the structure of the signal in relation to speech enhancement. The basis for the models is the harmonic model which is a very often used model for voiced speech because it describes periodic signals perfectly. First, we consider the problem of non-stationarity in the speech signal. The speech signal changes its characteristics continuously over time whereas most speech analysis and enhancement methods assume stationarity within 20-30 ms. We propose to change the model to allow the fundamental frequency to vary linearly over time by introducing a chirp rate in the model. Filters are derived based on this model and it is shown that they perform better than filters based on the traditional harmonic model. In the filter design, estimates of the fundamental frequency and chirp rate are needed. Therefore, an iterative nonlinear least squares method to estimate the parameters jointly is suggested. The estimator reaches the Cram?r-Rao bound, and the iterative approach makes the method faster than searching the original two dimensional space for the optimal combination of fundamental frequency and chirp rate. To counteract the effect of non-stationarity further, we suggest that the segment length should not be fixed but depend on the signal at the given moment. Thereby, short segments can be used when the signal characteristics vary fast, and long segments can be used when the characteristics are more stationary. We propose to choose the segment length according to the maximum a posteriori criteria and show that the segmentation based on the chirp model gives longer segments than for the harmonic model. This suggests that the chirp model fits the voiced speech signal better. Other deviations from the perfect harmonic model can occur. As it is well known from stiff-stringed musical instruments, the frequencies of the harmonics in speech may also deviate from the perfect harmonic relationship. We propose to take these deviations into account by extending the harmonic model to the inharmonic model where small perturbations at each harmonic can occur. Three different methods to estimate the inharmonicities are compared, and it is shown that including the estimate in the filter design leads to better performance than a filter based on the traditional harmonic model. We also propose to take a subspace perspective to speech enhancement by performing a joint diagonalisation of desired signal and noise. The eigenvectors generated from this operation is used to make a filter that estimates the noise, and the desired signal is estimated by subtracting the noise estimate from the observed signal. The filter is very flexible in the way that it can trade noise reduction and signal distortion based on how many eigenvectors are used in the filter design. The number of eigenvectors used in the filter in voiced speech periods can also be chosen based on the harmonic model since the number of harmonics in the speech signal is closely related to the best choice of number of eigenvectors. The papers in this thesis show that it can be beneficial to extend the traditional harmonic model to include non-stationarity and inharmonicity of speech. The derived filters perform better than filters based on the harmonic model in terms of signal-to-noise ratio and signal distortion. The voiced speech models can also be used to make a noise covariance matrix estimate which can be used in other algorithms as, e.g., the proposed joint diagonalisation based method.
