Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals
Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify di?erent physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identi?cation. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we ?rst convert variable-duration speech signals into ?xed-dimensional vectors suitable for classi?cation/regression algorithms. This is performed by ?tting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied to model the distribution of acoustic features. Due to lack of data, it is not possible to build a separate acoustic model for short utterances. Therefore, parametric utterance adaptation methods have been applied to adapt the universal background model (UBM) to the characteristics of utterances. The parameters of each adapted GMM characterize the corresponding utterance. An e?ective approach involves adapting UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are extracted and concatenated to form a Gaussian mean supervector for the given utterance. Finally, a classi?cation or regression algorithm is used to identify the speaker characteristics. While e?ective, Gaussian mean supervectors are of a high dimensionality resulting in high computational cost and di?culty in obtaining a robust model in the context of limited data. In the ?eld of speaker recognition, recent advances using the i-vector framework have increased the classi?cation accuracy considerably. This framework, which provides a compact representation of an utterance in the form of a low-dimensional feature vector, applies a simple factor analysis on GMM means. Motivated by this success, the ivector framework is applied to the age estimation problem. In this approach, each utterance is modeled by its corresponding i-vector. Then, a within-class covariance normalization (WCCN) technique is used for session variability compensation. Finally, a least squares support vector regression (LSSVR) is applied to estimate the age of speakers. The proposed method is trained and tested on telephone conversations of the National Institute for Standard and Technology (NIST) 2010 and 2008 speaker recognition evaluation databases. Evaluation results show that the proposed method yields signi?cantly lower mean absolute estimation error and a higher Pearson correlation coe?cient between chronological speaker age and the estimated speaker age comapred to di?erent conventional schemes. Finally, the e?ect of some major factors in?uencing the proposed age estimation system, namely utterance length and spoken language are analyzed. Our experiments on age estimation show that GMM weights carry important information about the speaker. However, the state-of-the-art language/speaker recognition systems usually do not use this information. In this research, a non-negative factor analysis (NFA) approach is developed for GMM weight decomposition and adaptation. This modeling suggests a new low-dimensional utterance representation method, which uses a factor analysis similar to that of the i-vector framework. The obtained subspace vectors are then applied in conjunction with i-vectors to the language/dialect recognition problem. The suggested approach is evaluated on the NIST 2011 and RATS language recognition evaluation (LRE) corpora and on the QCRI Arabic dialect recognition evaluation (DRE) corpus. The assessment results show that the proposed adaptation method yields more accurate recognition results compared to three conventional weight adaptation approaches, namely maximum likelihood re-estimation, non-negative matrix factorization, and a subspace multinomial model. Experimental results also show that the intermediate level fusion of i-vectors and NFA subspace vectors improves the performance of the state-ofthe-art i-vector framework. Motivated by the success of the NFA framework in Language/dialect recognition we introduce a hybrid architecture of the NFA approach and the i-vector frameworks for the speaker age estimation problem. Evaluation on the NIST 2010 and 2008 SRE corpora shows that the proposed hybrid architecture improves the results of the i-vector framework considerably.
