Statistical Parametric Speech Synthesis Based on the Degree of Articulation
Nowadays, speech synthesis is part of various daily life applications. The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications. However, current state-of-the-art systems often lack of realism: although high-quality speech synthesis can be produced by many researchers and companies around the world, synthetic voices are generally perceived as hyperarticulated. In any case, their degree of articulation is fixed once and for all. The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthesis, whose most famous example is Hidden Markov Model (HMM) based speech synthesis, by introducing a control of the articulation degree, so as to enable synthesizers to automatically adapt their way of speaking to the contextual situation, like humans do. The degree of articulation, which is probably the least studied prosodic parameters, is characterized by modifications of phonetic context, of speech rate and of spectral dynamics (vocal tract rate of change). It depends upon the surrounding environment and the communication context, and provides information on the relationship between the speaker and the listener(s). According to Lindblom’s “H and H” theory, speakers are expected to vary their output along a continuum of hypo and hyperarticulated speech. Compared to the neutral case, hyperarticulated speech tends to maximize the clarity of the speech signal by increasing the articulation efforts to produce it, while hypoarticulated speech is produced with minimal articulation efforts. The work presented in this PhD thesis provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. This framework is very convenient for creating a synthesizer whose speaker characteristics and speaking styles can be easily modified. In order to achieve this goal, a new French database consisting of three distinct and parallel sets (one for each articulation degree to be studied, i.e. neutral, hypoarticulated and hyperarticulated speech) was recorded. This database allows: i) the study of both acoustic and phonetic modifications due to articulatory effort changes; ii) the design of a high-quality speech synthesizer integrating a continuous control of the articulation degree. This first requires to address the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Secondly, we perform a perceptual study of speech with a variable articulation degree, specifically focusing on: i) the internal mechanisms leading to the perception of the degree of articulation by listeners (i.e. cepstrum, prosody, phonetic transcription adaptation and the complete adaptation); ii) how intelligibility and various other voice dimensions are affected. Based on the ensuing conclusions, we finally implement an automatic modification of the degree of articulation in an existing standard neutral voice for which no hypo or hyperarticulated recordings are available.
