Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora
When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system?s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a speech database comprised of utterances produced in neutral conditions (neutral speech), and in simulated noisy conditions (Lombard speech), and on the analysis of the speech production differences in these two speech modalities. A majority of the previous studies on the role of LE in ASR neglected the importance of the communication loop in evoking Lombard effect, and instead analyzed data from subjects who read text in noise without being provided feedback regarding whether their speech was intelligible. In this thesis, a novel setup imposes a communication factor to the Lombard recordings. An analysis of the recordings shows considerable differences between neutral and Lombard data for a number of speech production parameters. In ASR experiments, the performance of both large and small vocabulary recognizers severely degrade when switching from neutral to LE tasks. The second part of the thesis describes the design of new methods intended to reduce the impact of LE on ASR. The methods employ LE equalization, robust features, and model adjustments. The goal of LE equalization is to transform Lombard speech tokens towards neutral before they enter the acoustic models of the ASR engine. For this purpose, a modified vocal tract length normalization and formant-driven frequency warping are designed, both significantly improving the recognition performance under LE. In addition, a commercial voice conversion framework is evaluated and found to be partially effective for LE-equalization. A set of robust features are proposed in a data-driven design. Filter banks better reflecting the distribution of linguistic content in frequency are constructed and used as replacements for mel and Bark filter banks in MFCC (mel frequency cepstral coefficients) and PLP (perceptual linear prediction) front-ends. When employed in a recognition system on LE data, the novel features considerably outperform standard MFCC and PLP front-ends as well as state-of-the-art MR?RASTA (multi-resolution relative spectra) and Expolog front-ends. In the domain of model adjustments, an independently furnished acoustic model adaptation, which transforms neutral models towards Lombard speech characteristics, is shown to provide a substantial performance improvement on LE speech data. Finally, a two-stage recognition system (TSR) utilizing neutral/LE classification and style-specific acoustic modeling is proposed. Compared to multi-stage systems presented in other studies, TSR requires only neutral samples for training the style-specific models. On the mixture of neutral and Lombard utterances, TSR also significantly outperforms discrete style-specific recognizers. These contributions serve to advance both knowledge and algorithm development for speech recognition in Lombard effect.
