Prediction and Optimization of Speech Intelligibility in Adverse Conditions

In digital speech-communication systems like mobile phones, public address systems and hearing aids, conveying the message is one of the most important goals. This can be challenging since the intelligibility of the speech may be harmed at various stages before, during and after the transmission process from sender to receiver. Causes which create such adverse conditions include background noise, an unreliable internet connection during a Skype conversation or a hearing impairment of the receiver. To overcome this, many speech-communication systems include speech processing algorithms to compensate for these signal degradations like noise reduction. To determine the effect on speech intelligibility of these signal processing based solutions, the speech signal has to be evaluated by means of a listening test with human listeners. However, such tests are costly and time consuming. As an alternative, reliable and fast machine-driven intelligibility predictors are of interest, since they might replace listening tests, at least in some stages of the algorithm development process. Two important issues exist with current intelligibility predictors. (1) Many of these methods cannot reliably predict the effect of more advanced nonlinear signal processing algorithms on speech intelligibility. (2) Typically, these measures are based on very complex auditory models or use average statistics of minutes of running speech, which makes it difficult on how to design new (real-time) speech processing solutions in an optimal manner given such a measure. To this end we propose several new measures which show good prediction results with the intelligibility of nonlinear processed speech. The newly proposed measures are of a low computational complexity and mathematically tractable which make them suitable for optimization of new signal processing solutions which aim for improving speech intelligibility. An important stage in many speech intelligibility predictors is the use of an auditory model. In the first part of this thesis we show that a general sophisticated auditory model can be greatly simplified, while preserving accurate predictions of psycho-acoustic listening experiments. The resulting simplified model facilitates the computation of analytic expressions for masking thresholds while advanced state-of-the-art models typically need computationally demanding adaptive procedures. Its mathematical properties are successfully exploited by optimally redistributing speech energy such that the speech intelligibility is improved when played back in a noisy environment without modifying the signal-to-noise ratio. In the design process of new intelligibility predictors we first analyse the strengths and weaknesses of existing measures. In total, 17 different measures are evaluated for intelligibility prediction of time-frequency weighted noisy speech. We show that, despite high correlation with the listening test scores, several measures cannot predict the difference in intelligibility before and after signal processing. We explain that a state-of-the-art measure was not able to predict the intelligibility due to its sensitivity to the DFT-phase components. Issues with existing measures for intelligibility prediction are highlighted and a general normalization procedure as a pre-processing step is proposed which improves their correlation with speech intelligibility. We propose a new short-time intelligibility measure (STOI) which shows high correlation with the intelligibility of time-frequency weighted noisy speech, including noise-reduced and vocoded speech. In general, STOI shows better correlation with speech intelligibility compared to five other state-of-the-art objective intelligibility models. One important difference between STOI and other measures is its analysis length which is in the order of a few hundreds of ms rather than complete sentences or 20-30 ms length frames. Due to the simple structure of STOI we show in the final part of this thesis that the measure can be interpreted as a mathematical norm, which is applied in the channel-selection technique with cochlear-implant simulations. Several intelligibility predictors indicate large intelligibility improvements with the new method based on STOI compared to a peak-picking algorithm.

File Type: pdf
File Size: 3 MB
Publication Year: 2013
Author: Taal, Cees
Supervisors: R. L. Lagendijk, R. Heusdens, R. C. Hendriks
Institution: Delft University of Technology
Keywords: Noise reduction, objective measure, speech enhancement, speech intelligibility prediction, Auditory modeling, Perceptual model, Near-end speech enhancement, intelligibility improvement, transients, cochlear implants, channel selection.