Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...

Bahari, Mohamad Hasan — KU Leuven


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Models and Software Realization of Russian Speech Recognition based on Morphemic Analysis

Above 20% European citizens speak in Russian therefore the task of automatic recognition of Russian continuous speech has a key significance. The main problems of ASR are connected with the complex mechanism of Russian word-formation. Totally there exist above 3 million diverse valid word-forms that is very large vocabulary ASR task. The thesis presents the novel HMM-based ASR model of Russian that has morphemic levels of speech and language representation. The model includes the developed methods for decomposition of the word vocabulary into morphemes and acoustical and statistical language modelling at the training stage and the method for word synthesis at the last stage of speech decoding. The presented results of application of the ASR model for voice access to the Yellow Pages directory have shown the essential improvement (above 75%) of the real-time factor saving acceptable word recognition rate ...

Karpov, Alexey — St.Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences


Robust Speech Recognition on Intelligent Mobile Devices with Dual-Microphone

Despite the outstanding progress made on automatic speech recognition (ASR) throughout the last decades, noise-robust ASR still poses a challenge. Tackling with acoustic noise in ASR systems is more important than ever before for a twofold reason: 1) ASR technology has begun to be extensively integrated in intelligent mobile devices (IMDs) such as smartphones to easily accomplish different tasks (e.g. search-by-voice), and 2) IMDs can be used anywhere at any time, that is, under many different acoustic (noisy) conditions. On the other hand, with the aim of enhancing noisy speech, IMDs have begun to embed small microphone arrays, i.e. microphone arrays comprised of a few sensors close each other. These multi-sensor IMDs often embed one microphone (usually at their rear) intended to capture the acoustic environment more than the speaker’s voice. This is the so-called secondary microphone. While classical microphone ...

López-Espejo, Iván — University of Granada


Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech

One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to ...

Sun, Meng — Katholieke Universiteit Leuven


Transmission over Time- and Frequency-Selective Mobile Wireless Channels

The wireless communication industry has experienced rapid growth in recent years, and digital cellular systems are currently designed to provide high data rates at high terminal speeds. High data rates give rise to intersymbol interference (ISI) due to so-called multipath fading. Such an ISI channel is called frequency selective. On the other hand, due to terminal mobility and/or receiver frequency offset the received signal is subject to frequency shifts (Doppler shifts). Doppler shift induces time-selectivity characteristics. The Doppler effect in conjunction with ISI gives rise to a so-called doubly selective channel (frequency- and time-selective). In addition to the channel effects, the analog front-end may suffer from an imbalance between the I and Q branch amplitudes and phases as well as from carrier frequency offset. These analog front-end imperfections then result in an additional and significant degradation in system performance, especially ...

Barhumi, Imad — Katholieke Universiteit Leuven


Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University


Signal Processing for Ultra Wideband Transceivers

In this thesis novel implementation approaches for standardized and non-standardized ultra wide-band (UWB) systems are presented. These implementation approaches include signal processing algorithms to achieve processing of UWB signals in transceiver front-ends and in digital back-ends. A parallelization of the transceiver in the frequency-domain has been achieved with hybrid filterbank transceivers. The standardized MB-OFDM signaling scheme allows par- allelization in the frequency domain by distributing the orthogonal multicarrier modulation onto multiple units. Furthermore, the channel’s response to wideband signals has been parallelized in the frequency domain and the effects of the parallelization have been investi- gated. Slight performance decreases are observed, where the limiting effects are truncated sidelobes and filter mismatches in analog front-ends. Measures for the performance loss have been defined. For UWB signal generation, a novel broadband signal generation approach is presented. For that purpose, multiple digital-to-analog converters ...

Krall, Christoph — Graz University of Technology


Digital compensation of front-end non-idealities in broadband communication systems

The wireless communication industry has seen a tremendous growth in the last few decades. The ever increasing demand to stay connected at home, work, and on the move, with voice and data applications, has continued the need for more sophisticated end-user devices. A typical smart communication device these days consists of a radio system that can access a mixture of mobile cellular services (GSM, UMTS, etc), indoor wireless broadband services (WLAN-802.11b/g/n), short range and low energy personal communications (Bluetooth), positioning and navigation systems (GPS), etc. A smart device capable of meeting all these requirements has to be highly flexible and should be able to reconfigure radio transmitters and receivers as and when required. Further, the radio modules used in these devices should be extremely small so that the device itself is portable. In addition, the device should also be economical ...

Tandur, Deepaknath — Katholieke Universiteit Leuven


Generalized Noncoherent Ultra-Wideband Receivers

This thesis investigates noncoherent multi-channel ultra-wideband receivers. Noncoherent ultra-wideband receivers promise low power consumption and low processing complexity as they, in contrast to coherent receiver architectures, relinquish the need of complex carrier frequency and phase recovering. Unfortunately, their peak data rate is limited by the delay spread of the multipath radio channel. Noncoherent multi-channel receivers can break this rate limit due to their capability to demodulate multi-carrier signals. Such receivers use an analog front-end to separate the received signals into their sub-channels. In this work, the modeling and optimization of realistic front-end components is addressed and their impact on the system performance of noncoherent multi-channel ultra-wideband receivers is analyzed. With a proposed generalized mathematical framework, it is shown that there exists a variety of noncoherent multi-channel receiver types with similar system performance which differ only in their front-end filters. It ...

Pedroß-Engel, Andreas — Graz University of Technology


Perceptually-Based Signal Features for Environmental Sound Classification

This thesis faces the problem of automatically classifying environmental sounds, i.e., any non-speech or non-music sounds that can be found in the environment. Broadly speaking, two main processes are needed to perform such classification: the signal feature extraction so as to compose representative sound patterns and the machine learning technique that performs the classification of such patterns. The main focus of this research is put on the former, studying relevant signal features that optimally represent the sound characteristics since, according to several references, it is a key issue to attain a robust recognition. This type of audio signals holds many differences with speech or music signals, thus specific features should be determined and adapted to their own characteristics. In this sense, new signal features, inspired by the human auditory system and the human perception of sound, are proposed to improve ...

Valero, Xavier — La Salle-Universitat Ramon Llull


Source-Filter Model Based Single Channel Speech Separation

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which find a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The first part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition ...

Stark, Michael — Graz University of Technology


Improving Speech Recognition for Pluricentric Languages exemplified on Varieties of German

A method is presented to improve speech recognition for pluricentric languages. Both the effect of adaptation of acoustic data and phonetic transcriptions for several subregions of the German speaking area are investigated and discussed. All experiments were carried out for German spoken in Germany and Austria using large telephone databases (Speech-Dat). In the first part triphone-based acoustic models (AMOs) were trained for several regions and their word error rates (WERs) were compared. The WERs vary between 9.89% and 21.78% and demonstrate the importance of regional variety adaptation. In the pronunciation modeling part narrow phonetic transcriptions for a subset of the Austrian database were carried out to derive pronunciation rules for Austrian German and to generate phonetic lexica for Austrian German which are the first of their kind. These lexica were used for both triphone-based and monophone-based AMOs with German and ...

Micha Baum — TU Graz


Wavelet Analysis For Robust Speech Processing and Applications

In this work, we study the application of wavelet analysis for robust speech processing. Reliable time-scale features (TS) which characterize the relevant phonetic classes such as voiced (V), unvoiced (UV), silence (S), mixed-excitation, and stop sounds are extracted. By training neural and Bayesian networks, the classification rates provided by only 7 TS features are mostly similar to the ones obtained by 13 MFCC features. The TS features are further enhanced to design a reliable and low-complexity V/UV/S classifier. Quantile filtering and slope tracking are used for deriving adaptive thresholds. A robust voice activity detector is then built and used as a pre-processing stage to improve the performance of a speaker verification system. Based on wavelet shrinkage, a statistical wavelet filtering (SWF) method is designed for speech enhancement. Non-stationary and colored noise is handled by employing quantile filtering and time-frequency adaptive ...

Pham, Van Tuan — Graz University of Technology


Statistical Parametric Speech Synthesis Based on the Degree of Articulation

Nowadays, speech synthesis is part of various daily life applications. The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications. However, current state-of-the-art systems often lack of realism: although high-quality speech synthesis can be produced by many researchers and companies around the world, synthetic voices are generally perceived as hyperarticulated. In any case, their degree of articulation is fixed once and for all. The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthesis, whose most famous example is Hidden Markov Model (HMM) based speech synthesis, by introducing a control of the articulation degree, so as to enable synthesizers to automatically adapt their way of speaking to the contextual situation, like humans ...

Picart, Benjamin — Université de Mons (UMONS)

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.