High-Quality Vocoding Design with Signal Processing for Speech Synthesis and Voice Conversion

This Ph.D. thesis focuses on developing a system for high-quality speech synthesis and voice conversion. Vocoder-based speech analysis, manipulation, and synthesis plays a crucial role in various kinds of statistical parametric speech research. Although there are vocoding methods which yield close to natural synthesized speech, they are typically computationally expensive, and are thus not suitable for real-time implementation, especially in embedded environments. Therefore, there is a need for simple and computationally feasible digital signal processing algorithms for generating high-quality and natural-sounding synthesized speech. In this dissertation, I propose a solution to extract optimal acoustic features and a new waveform generator to achieve higher sound quality and conversion accuracy by applying advances in deep learning. The approach remains computationally efficient. This challenge resulted in five thesis groups, which are briefly summarized below. I introduce firstly a new method to shape the high-frequency component of the unvoiced excitation by estimating the temporal envelope of the residual signal. I showed experimentally that this approach is helpful in achieving accurate approximations compared to natural speech. Secondly, I propose a new type of noise masking to reduce the perceptual effect of the residual noise and allowing a proper reconstruction of noise characteristics. The results suggest that the continuous masking approach gives better quality speech than traditional binary techniques of the literature. Next, I concern with estimating the fundamental frequency (also known as pitch tracking or F0) on clean and noisy speech signals, which acts as a key in speech processing applications. I describe novel approaches which can be used to enhance and optimize some other existing F0 estimator algorithms. Three adaptive techniques based on Kalman-filter, time-warping, and instantaneous-frequency have been developed in order to achieve a robust and accurate continuous F0. As a result, these approaches achieve higher accuracy and smoother continuous F0 trajectory on noisy and clean speech. In addition, I propose and perform an experiment showing that adding a new excitation harmonic-to-noise ratio (HNR) parameter to the voiced and unvoiced components can indicate the degree of voicing in the excitation and reduced the influence of buzziness caused by the vocoder. Later on, I build and implement deep learning based acoustic modeling using deep feed- forward and sequence-to-sequence recurrent neural networks. A perception and acoustic experiments have shown that the developed vocoder can be applied by the proposed learning framework and showed its superiority against hidden Markov-model based text-to-speech (HMM-TTS). Afterwards, I propose a new continuous sinusoidal model (CSM) that is applicable in statistical frameworks, which can provide a vocoder with a fixed and low number of parameters and generate high quality synthetic speech compared to state-of-the-art models of speech. I also apply CSM with deep learning based on bidirectional long short-term memory (LSTM) to provide more natural and intelligible TTS capabilities. Finally, I apply the two vocoders using continuous parameters (source-filter and sinusoidal models) within a voice conversion framework. I experimentally proved that the suggested models give state-of-the-art similarity results. Overall, this Ph.D. dissertation has established competitive alternative vocoders for speech analysis and synthesis systems. The utilization of proposed models and methods clearly demonstrates that it is compelling to apply them for the statistical parametric speech synthesis and voice conversion.

File Type: pdf
File Size: 7 MB
Publication Year: 2020
Author: Al-Radhi Mohammed Salah
Supervisors: Prof. G?za N?meth, Dr. Tam?s G?bor Csap? (who has passed away)
Institution: Budapest University of Technology and Economics
Keywords: Deep learning, Signal processing, Speech synthesis, AI conversation, Neural vocoder, Voice conversion, Noise masking