Probabilistic Model-Based Multiple Pitch Tracking of Speech

Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speaker short-time spectrum is obtained a priori from clean speech data. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using an interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of multiple speakers over time. The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While SI models allow an ad-hoc use in situations where the speakers in a recording are unknown, SD models have the great advantage that pitch trajectories can be assigned to their corresponding speakers. The accuracy of the proposed method is evaluated on two speech databases and compared to a state-of-the-art algorithm for multi-pitch tracking of speech. Two problems related to the proposed approach are addressed. (i) Exact inference has a high computational demand, mainly due to the fact that the solution is obtained by considering all possible pitch combinations across speakers. A novel method for approximate inference based on likelihood pruning is proposed. The method is based on a computationally efficient upper and lower bound on the likelihood of pitch combinations. The approximate method is experimentally evaluated in terms of accuracy and time requirements, and results for tracking the pitch of three simultaneously talking speakers are demonstrated. (ii) Any mismatch between training and testing conditions (such as different acoustic channel conditions or gain mismatches) deteriorates the accuracy of multi-pitch tracking. It is desirable to adapt speaker models to novel environmental conditions during multi-pitch tracking, i.e. in situations where only a mixture of speakers is available. We propose a modification of the maximum likelihood linear regression (MLLR) technique where the adaptation of model parameters is constrained to modifications of the spectral envelope. This constraint is beneficial for cases where few adaptation data is available. Based on this, we propose a novel expectation-maximization (EM) algorithm for adaptation of speaker models from speech mixtures, and demonstrate tracking results obtained for a distant talking scenario of two speakers which includes room reverberation.

File Type: pdf
File Size: 2 MB
Publication Year: 2012
Author: Wohlmayr, Michael
Supervisors: Franz Pernkopf, Gernot Kubin
Institution: Graz University of Technology
Keywords: Speech analysis, multipitch tracking, factorial hidden Markov model