A Multimodal Approach to Audiovisual Text-to-Speech Synthesis

Speech, consisting of an auditory and a visual signal, has always been the most important means of communication between humans. It is well known that an optimal conveyance of the message requires that both the auditory and the visual speech signal can be perceived by the receiver. Nowadays people interact countless times with computer systems in every-day situations. Since the ultimate goal is to make this interaction feel completely natural and familiar, the most optimal way to interact with a computer system is by means of speech. Similar to the speech communication between humans, the most appropriate human-machine interaction consists of audiovisual speech signals. In order to allow the computer system to transfer a spoken message towards its users, an audiovisual speech synthesizer is needed to generate novel audiovisual speech signals based on a given text. This dissertation focuses on the development of a single-phase audiovisual speech synthesis approach, in which both speech modes are generated simultaneously. The proposed synthesis strategy constructs the desired speech signal by concatenating audiovisual speech segments containing an original combination of auditory and visual speech information. This allows maximizing the level of audiovisual coherence between the two synthetic speech modes. High-quality audiovisual speech synthesis is achieved by multiple optimizations to the synthesizer, such as a normalization of the original visual speech data and a smoothing of the synthetic visual speech without affecting the audiovisual coherence. By the construction of a new extensive Dutch audiovisual speech database, the first-ever system capable of high-quality photorealistic audiovisual speech synthesis for Dutch is developed. Through various subjective perception experiments it is concluded that the maximization of the level of audiovisual coherence is indeed necessary for achieving an optimal perception of the synthetic audiovisual speech signal. For visual speech-only speech synthesis purposes, the speech information can be described by means of either phoneme or viseme labels. The attainable synthesis quality using phoneme labels is compared with the synthesis quality attained using both standardized and speaker-dependent many-to-one phoneme-to-viseme mappings. In addition, novel context-dependent many-to-many phoneme-to-viseme mapping strategies are investigated and evaluated for synthesis.

File Type: pdf
File Size: 17 MB
Publication Year: 2013
Author: Mattheyses, Wesley
Supervisors: Werner Verhelst
Institution: Vrije Universiteit Brussel
Keywords: visual speech synthesis, audiovisual speech synthesis, audiovisual speech perception, phoneme-to-viseme mapping