Automatic Recognition of Ageing Speakers

The process of ageing causes changes to the voice over time. There have been significant research efforts in the automatic speaker recognition community towards improving performance in the presence of everyday variability. The influence of long-term variability, due to vocal ageing, has received only marginal attention however. In this Thesis, the impact of vocal ageing on speaker verification and forensic speaker recognition is assessed, and novel methods are proposed to counteract its effect. The Trinity College Dublin Speaker Ageing (TCDSA) database, compiled for this study, is first introduced. Containing 26 speakers, with recordings spanning an age difference of between 28 and 58 years per speaker, it is the largest longitudinal speech database in the public domain. A Gaussian Mixture Model-Universal Background Model (GMM-UBM) speaker verification experiment demonstrates a progressive decline in the scores of genuine-speakers as the age difference between ...

Kelly, Finnian — Trinity College Dublin

Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora

When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE), involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system’s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a ...

Boril, Hynek — Czech Technical University in Prague

Modelling context in automatic speech recognition

Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the other hand, recognising speech is a daunting task. It has to deal with a large number of different voices "influenced, among other things, by emotion, moods and fatigue" the acoustic properties of different environments, dialects, a huge vocabulary and an unlimited creativity of speakers to combine words and to break the rules of grammar. Almost all existing automatic speech recognisers use statistics over speech sounds "what is the probability that a piece of audio is an a-sound" and statistics over word combinations to deal with this complexity. The ...

Wiggers, Pascal — Delft University of Technology

Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki

Probabilistic Model-Based Multiple Pitch Tracking of Speech

Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speaker short-time spectrum is obtained a priori from clean speech data. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using an interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of multiple speakers over time. The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While ...

Wohlmayr, Michael — Graz University of Technology

Video person recognition strategies using head motion and facial appearance

In this doctoral dissertation, we principally explore the use of the temporal information available in video sequences for person and gender recognition; in particular, we focus on the analysis of head and facial motion, and their potential application as biometric identifiers. We also investigate how to exploit as much video information as possible for the automatic recognition; more precisely, we examine the possibility of integrating the head and mouth motion information with facial appearance into a multimodal biometric system, and we study the extraction of novel spatio-temporal facial features for recognition. We initially present a person recognition system that exploits the unconstrained head motion information, extracted by tracking a few facial landmarks in the image plane. In particular, we detail how each video sequence is firstly pre-processed by semiautomatically detecting the face, and then automatically tracking the facial landmarks over ...

Matta, Federico — Eurécom / Multimedia communications

Fusing prosodic and acoustic information for speaker recognition

Automatic speaker recognition is the use of a machine to identify an individual from a spoken sentence. Recently, this technology has been undergone an increasing use in applications such as access control, transaction authentication, law enforcement, forensics, and system customisation, among others. One of the central questions addressed by this field is what is it in the speech signal that conveys speaker identity. Traditionally, automatic speaker recognition systems have relied mostly on short-term features related to the spectrum of the voice. However, human speaker recognition relies on other sources of information; therefore, there is reason to believe that these sources can play also an important role in the automatic speaker recognition task, adding complementary knowledge to the traditional spectrum-based recognition systems and thus improving their accuracy. The main objective of this thesis is to add prosodic information to a traditional ...

Farrus, Mireia — Universitat Politecnica de Catalunya

Confidence Measures for Speech/Speaker Recognition and Applications on Turkish LVCSR

Con dence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Con dence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system. Speech/speaker recognition systems are usually based on statistical modeling techniques. In this thesis we de ned con dence measures for statistical modeling techniques used in speech/speaker recognition systems. For speech recognition we tested available con dence measures and the newly de ned acoustic prior information based con dence measure in two di erent conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly de ned con dence measure performs better in both tests. Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis. We de ned also ...

Mengusoglu, Erhan — Universite de Mons

Signal processing algorithms for wireless acoustic sensor networks

Recent academic developments have initiated a paradigm shift in the way spatial sensor data can be acquired. Traditional localized and regularly arranged sensor arrays are replaced by sensor nodes that are randomly distributed over the entire spatial field, and which communicate with each other or with a master node through wireless communication links. Together, these nodes form a so-called ‘wireless sensor network’ (WSN). Each node of a WSN has a local sensor array and a signal processing unit to perform computations on the acquired data. The advantage of WSNs compared to traditional (wired) sensor arrays, is that many more sensors can be used that physically cover the full spatial field, which typically yields more variety (and thus more information) in the signals. It is likely that future data acquisition, control and physical monitoring, will heavily rely on this type of ...

Bertrand, Alexander — Katholieke Universiteit Leuven

A multimicrophone approach to speech processing in a smart-room environment

Recent advances in computer technology and speech and language processing have made possible that some new ways of person-machine communication and computer assistance to human activities start to appear feasible. Concretely, the interest on the development of new challenging applications in indoor environments equipped with multiple multimodal sensors, also known as smart-rooms, has considerably grown. In general, it is well-known that the quality of speech signals captured by microphones that can be located several meters away from the speakers is severely distorted by acoustic noise and room reverberation. In the context of the development of hands-free speech applications in smart-room environments, the use of obtrusive sensors like close-talking microphones is usually not allowed, and consequently, speech technologies must operate on the basis of distant-talking recordings. In such conditions, speech technologies that usually perform reasonably well in free of noise and ...

Abad, Alberto — Universitat Politecnica de Catalunya

Forensic Evaluation of the Evidence Using Automatic Speaker Recognition Systems

This Thesis is focused on the use of automatic speaker recognition systems for forensic identification, in what is called forensic automatic speaker recognition. More generally, forensic identification aims at individualization, defined as the certainty of distinguishing an object or person from any other in a given population. This objective is followed by the analysis of the forensic evidence, understood as the comparison between two samples of material, such as glass, blood, speech, etc. An automatic speaker recognition system can be used in order to perform such comparison between some recovered speech material of questioned origin (e.g., an incriminating wire-tapping) and some control speech material coming from a suspect (e.g., recordings acquired in police facilities). However, the evaluation of such evidence is not a trivial issue at all. In fact, the debate about the presentation of forensic evidence in a court ...

Ramos, Daniel — Universidad Autonoma de Madrid

Support Vector Machine Based Approach for Speaker Characterization

This doctoral thesis focuses on the development of algorithms of speaker characterisation by voice. Namely, characterisation of speaker’s identity, and the emotional state detectable in his voice while using the application of state-of-the art classifier algorithm Support Vector Machine (SVM) will be discussed. The first part deals with the state of the art SVM classifier utilised for classification experiments where we search for more sophisticated form of SVM model parameters selection. Also, we successfully apply optimization methods (PSO and GA algorithm) on two classification problems. The second part of this thesis deal with emotion recognition in continuous/dimensional space.

Hric, Martin — University of Žilina

New strategies for single-channel speech separation

We present new results on single-channel speech separation and suggest a new separation approach to improve the speech quality of separated signals from an observed mix- ture. The key idea is to derive a mixture estimator based on sinusoidal parameters. The proposed estimator is aimed at finding sinusoidal parameters in the form of codevectors from vector quantization (VQ) codebooks pre-trained for speakers that, when combined, best fit the observed mixed signal. The selected codevectors are then used to reconstruct the recovered signals for the speakers in the mixture. Compared to the log- max mixture estimator used in binary masks and the Wiener filtering approach, it is observed that the proposed method achieves an acceptable perceptual speech quality with less cross- talk at different signal-to-signal ratios. Moreover, the method is independent of pitch estimates and reduces the computational complexity of the ...

Pejman Mowlaee — Department of Electronic Systems, Aalborg University

Statistical Parametric Speech Synthesis Based on the Degree of Articulation

Nowadays, speech synthesis is part of various daily life applications. The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications. However, current state-of-the-art systems often lack of realism: although high-quality speech synthesis can be produced by many researchers and companies around the world, synthetic voices are generally perceived as hyperarticulated. In any case, their degree of articulation is fixed once and for all. The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthesis, whose most famous example is Hidden Markov Model (HMM) based speech synthesis, by introducing a control of the articulation degree, so as to enable synthesizers to automatically adapt their way of speaking to the contextual situation, like humans ...

Picart, Benjamin — Université de Mons (UMONS)

Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.