Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech

One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to ...

Sun, Meng — Katholieke Universiteit Leuven


Performative Statistical Parametric Speech Synthesis Applied To Interactive Designs

This dissertation introduces interactive designs in the context of statistical parametric synthesis. The objective is to develop methods and designs that enrich the Human-Computer Interaction by enabling computers (or other devices) to have more expressive and adjustable voices. First, we tackle the problem of interactive controls and present a novel method for performative HMM-based synthesis (pHTS). Second, we apply interpolation methods, initially developed for the traditional HMM-based speech synthesis system, in the interactive framework of pHTS. Third, we integrate articulatory control in our interactive approach. Fourth, we present a collection of interactive applications based on our work. Finally, we unify our research into an open source library, Mage. To our current knowledge Mage is the first system for interactive programming of HMM-based synthesis that allows realtime manipulation of all speech production levels. It has been used also in cases that ...

Astrinaki, Maria — University of Mons


Audio Visual Speech Enhancement

This thesis presents a novel approach to speech enhancement by exploiting the bimodality of speech production and the correlation that exists between audio and visual speech information. An analysis into the correlation of a range of audio and visual features reveals significant correlation to exist between visual speech features and audio filterbank features. The amount of correlation was also found to be greater when the correlation is analysed with individual phonemes rather than across all phonemes. This led to building a Gaussian Mixture Model (GMM) that is capable of estimating filterbank features from visual features. Phoneme-specific GMMs gave lower filterbank estimation errors and a phoneme transcription is decoded using audio-visual Hidden Markov Model (HMM). Clean filterbank estimates along with mean noise estimates were then utilised to construct visually-derived Wiener filters that are able to enhance noisy speech. The mean noise ...

Almajai, Ibrahim — University of East Anglia


Enhancement of Periodic Signals: with Application to Speech Signals

The topic of this thesis is the enhancement of noisy, periodic signals with application to speech signals. Generally speaking, enhancement methods can be divided into signal- and noise-driven methods. In this thesis, we focus on the signal-driven approach by employing relevant signal parameters for the enhancement of periodic signals. The enhancement problem consists of two major subproblems: the estimation of relevant parameters or statistics, and the actual noise reduction of the observed signal. We consider both of these subproblems. First, we consider the problem of estimating signal parameters relevant to the enhancement of periodic signals. The fundamental frequency is one example of such a parameter. Furthermore, in multichannel scenarios, the direction-of-arrival of the periodic sources onto an array of sensors is another parameter of relevance. We propose methods for the estimation of the fundamental frequency that have benefits compared to ...

Jensen, Jesper Rindom — Aalborg University


Fire Detection Algorithms Using Multimodal Signal and Image Analysis

Dynamic textures are common in natural scenes. Examples of dynamic textures in video include fire, smoke, clouds, volatile organic compound (VOC) plumes in infra-red (IR) videos, trees in the wind, sea and ocean waves, etc. Researchers extensively studied 2-D textures and related problems in the fields of image processing and computer vision. On the other hand, there is very little research on dynamic texture detection in video. In this dissertation, signal and image processing methods developed for detection of a specific set of dynamic textures are presented. Signal and image processing methods are developed for the detection of flames and smoke in open and large spaces with a range of up to $30$m to the camera in visible-range (IR) video. Smoke is semi-transparent at the early stages of fire. Edges present in image frames with smoke start loosing their sharpness ...

Toreyin, Behcet Ugur — Bilkent University


Structured and Sequential Representations For Human Action Recognition

Human action recognition problem is one of the most challenging problems in the computer vision domain, and plays an emerging role in various fields of study. In this thesis, we investigate structured and sequential representations of spatio-temporal data for recognizing human actions and for measuring action performance quality. In video sequences, we characterize each action with a graphical structure of its spatio-temporal interest points and each such interest point is qualified by its cuboid descriptors. In the case of depth data, an action is represented by the sequence of skeleton joints. Given such descriptors, we solve the human action recognition problem through a hyper-graph matching formulation. As is known, hyper-graph matching problem is NP-complete. We simplify the problem in two stages to enable a fast solution: In the first stage, we take into consideration the physical constraints such as time ...

Celiktutan, Oya — Bogazici University


Advances in DFT-Based Single-Microphone Speech Enhancement

The interest in the field of speech enhancement emerges from the increased usage of digital speech processing applications like mobile telephony, digital hearing aids and human-machine communication systems in our daily life. The trend to make these applications mobile increases the variety of potential sources for quality degradation. Speech enhancement methods can be used to increase the quality of these speech processing devices and make them more robust under noisy conditions. The name "speech enhancement" refers to a large group of methods that are all meant to improve certain quality aspects of these devices. Examples of speech enhancement algorithms are echo control, bandwidth extension, packet loss concealment and noise reduction. In this thesis we focus on single-microphone additive noise reduction and aim at methods that work in the discrete Fourier transform (DFT) domain. The main objective of the presented research ...

Hendriks, Richard Christian — Delft University of Technology


Novel texture synthesis methods and their application to image prediction and image inpainting

This thesis presents novel exemplar-based texture synthesis methods for image prediction (i.e., predictive coding) and image inpainting problems. The main contributions of this study can also be seen as extensions to simple template matching, however the texture synthesis problem here is well-formulated in an optimization framework with different constraints. The image prediction problem has first been put into sparse representations framework by approximating the template with a sparsity constraint. The proposed sparse prediction method with locally and adaptive dictionaries has been shown to give better performance when compared to static waveform (such as DCT) dictionaries, and also to the template matching method. The image prediction problem has later been placed into an online dictionary learning framework by adapting conventional dictionary learning approaches for image prediction. The experimental observations show a better performance when compared to H.264/AVC intra and sparse prediction. ...

Turkan, Mehmet — INRIA-Rennes, France


Bayesian Fusion of Multi-band Images: A Powerful Tool for Super-resolution

Hyperspectral (HS) imaging, which consists of acquiring a same scene in several hundreds of contiguous spectral bands (a three dimensional data cube), has opened a new range of relevant applications, such as target detection [MS02], classification [C.-03] and spectral unmixing [BDPD+12]. However, while HS sensors provide abundant spectral information, their spatial resolution is generally more limited. Thus, fusing the HS image with other highly resolved images of the same scene, such as multispectral (MS) or panchromatic (PAN) images is an interesting problem. The problem of fusing a high spectral and low spatial resolution image with an auxiliary image of higher spatial but lower spectral resolution, also known as multi-resolution image fusion, has been explored for many years [AMV+11]. From an application point of view, this problem is also important as motivated by recent national programs, e.g., the Japanese next-generation space-borne ...

Wei, Qi — University of Toulouse


Speech recognition in noisy conditions using missing feature approach

The research in this thesis addresses the problem of automatic speech recognition in noisy environments. Automatic speech recognition systems obtain acceptable performances in noise free conditions but these performances degrade dramatically in presence of additive noise. This is mainly due to the mismatch between the training and the noisy operating conditions. In the time-frequency representation of the noisy speech signal, some of the clean speech features are masked by noise. In this case the clean speech features cannot be correctly estimated from the noisy speech and therefore they are considered as missing or unreliable. In order to improve the performance of speech recognition systems in additive noise conditions, special attention should be paid to the problems of detection and compensation of these unreliable features. This thesis is concerned with the problem of missing features applied to automatic speaker-independent speech recognition. ...

Renevey, Philippe — Swiss Federal Institute of Technology


Sound Source Separation in Monaural Music Signals

Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more efficiently and flexibly than polyphonic mixtures. This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling. Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is modeled ...

Virtanen, Tuomas — Tampere University of Technology


Contributions to Statistical Modeling for Minimum Mean Square Error Estimation in Speech Enhancement

This thesis deals with minimum mean square error (MMSE) speech enhancement schemes in the short-time Fourier transform (STFT) domain with a focus on statistical models for speech and corresponding estimators. MMSE speech enhancement approaches taking speech presence uncertainty (SPU) into account usually consist of a common MMSE estimator for speech and an a posteriori speech presence probability (SPP) estimator. It is shown that both estimators should be based on the same statistical speech model, as they are in the same estimation framework and assume the same a priori knowledge. In order to give a synopsis of consistent MMSE estimation under SPU, typical common MMSE estimators and a posteriori SPP estimators are recapitulated. Furthermore, a new specific a posteriori SPP estimator is derived based on a novel statistical model for speech. Then, a synopsis of approaches to consistent MMSE estimation under ...

Fodor, Balázs — Technische Universität Braunschweig


Speech Enhancement Algorithms for Audiological Applications

The improvement of speech intelligibility is a traditional problem which still remains open and unsolved. The recent boom of applications such as hands-free communi- cations or automatic speech recognition systems and the ever-increasing demands of the hearing-impaired community have given a definitive impulse to the research in this area. This PhD thesis is focused on speech enhancement for audiological applications. Most of the research conducted in this thesis has been focused on the improvement of speech intelligibility in hearing aids, considering the variety of restrictions and limitations imposed by this type of devices. The combination of source separation techniques and spatial filtering with machine learning and evolutionary computation has originated novel and interesting algorithms which are included in this thesis. The thesis is divided in two main parts. The first one contains a preliminary study of the problem and a ...

Ayllón, David — Universidad de Alcalá


Probabilistic Model-Based Multiple Pitch Tracking of Speech

Multiple pitch tracking of speech is an important task for the segregation of multiple speakers in a single-channel recording. In this thesis, a probabilistic model-based approach for estimation and tracking of multiple pitch trajectories is proposed. A probabilistic model that captures pitch-dependent characteristics of the single-speaker short-time spectrum is obtained a priori from clean speech data. The resulting speaker model, which is based on Gaussian mixture models, can be trained either in a speaker independent (SI) or a speaker dependent (SD) fashion. Speaker models are then combined using an interaction model to obtain a probabilistic description of the observed speech mixture. A factorial hidden Markov model is applied for tracking the pitch trajectories of multiple speakers over time. The probabilistic model-based approach is capable to explicitly incorporate timbral information and all associated uncertainties of spectral structure into the model. While ...

Wohlmayr, Michael — Graz University of Technology


Radial Basis Function Network Robust Learning Algorithms in Computer Vision Applications

This thesis introduces new learning algorithms for Radial Basis Function (RBF) networks. RBF networks is a feed-forward two-layer neural network used for functional approximation or pattern classification applications. The proposed training algorithms are based on robust statistics. Their theoretical performance has been assessed and compared with that of classical algorithms for training RBF networks. The applications of RBF networks described in this thesis consist of simultaneously modeling moving object segmentation and optical flow estimation in image sequences and 3-D image modeling and segmentation. A Bayesian classifier model is used for the representation of the image sequence and 3-D images. This employs an energy based description of the probability functions involved. The energy functions are represented by RBF networks whose inputs are various features drawn from the images and whose outputs are objects. The hidden units embed kernel functions. Each kernel ...

Bors, Adrian G. — Aristotle University of Thessaloniki

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.