Sparsity in Linear Predictive Coding of Speech

This thesis deals with developing improved modeling methods for speech and audio processing based on the recent developments in sparse signal representation. In particular, this work is motivated by the need to address some of the limitations of the well-known linear prediction (LP) based all-pole models currently applied in many modern speech and audio processing systems. In the first part of this thesis, we introduce \emph{Sparse Linear Prediction}, a set of speech processing tools created by introducing sparsity constraints into the LP framework. This approach defines predictors that look for a sparse residual rather than a minimum variance one, with direct applications to coding but also consistent with the speech production model of voiced speech, where the excitation of the all-pole filter is model as an impulse train. Introducing sparsity in the LP framework, will also bring to develop the ...

Giacobello, Daniele — Aalborg University


Advances in Perceptual Stereo Audio Coding Using Linear Prediction Techniques

A wide range of techniques for coding a single-channel speech and audio signal has been developed over the last few decades. In addition to pure redundancy reduction, sophisticated source and receiver models have been considered for reducing the bit-rate. Traditionally, speech and audio coders are based on different principles and thus each of them offers certain advantages. With the advent of high capacity channels, networks, and storage systems, the bit-rate versus quality compromise will no longer be the major issue; instead, attributes like low-delay, scalability, computational complexity, and error concealments in packet-oriented networks are expected to be the major selling factors. Typical audio coders such as MP3 and AAC are based on subband or transform coding techniques that are not easily reconcilable with a low-delay requirement. The reasons for their inherently longer delay are the relatively long band splitting filters ...

Biswas, Arijit — Technische Universiteit Eindhoven


Embedded Optimization Algorithms for Perceptual Enhancement of Audio Signals

This thesis investigates the design and evaluation of an embedded optimization framework for the perceptual enhancement of audio signals which are degraded by linear and/or nonlinear distortion. In general, audio signal enhancement has the goal to improve the perceived audio quality, speech intelligibility, or another desired perceptual attribute of the distorted audio signal by applying a real-time digital signal processing algorithm. In the designed embedded optimization framework, the audio signal enhancement problem under consideration is formulated and solved as a per-frame numerical optimization problem, allowing to compute the enhanced audio signal frame that is optimal according to a desired perceptual attribute. The first stage of the embedded optimization framework consists in the formulation of the per-frame optimization problem aimed at maximally enhancing the desired perceptual attribute, by explicitly incorporating a suitable model of human sound perception. The second stage of ...

Defraene, Bruno — KU Leuven


Identification of versions of the same musical composition by processing audio descriptions

Automatically making sense of digital information, and specially of music digital documents, is an important problem our modern society is facing. In fact, there are still many tasks that, although being easily performed by humans, cannot be effectively performed by a computer. In this work we focus on one of such tasks: the identification of musical piece versions (alternate renditions of the same musical composition like cover songs, live recordings, remixes, etc.). In particular, we adopt a computational approach solely based on the information provided by the audio signal. We propose a system for version identification that is robust to the main musical changes between versions, including timbre, tempo, key and structure changes. Such a system exploits nonlinear time series analysis tools and standard methods for quantitative music description, and it does not make use of a specific modeling strategy ...

Serra, Joan — Universitat Pompeu Fabra


Simulation Methods for Linear and Nonlinear Time Series Models with Application to Distorted Audio Signals

This dissertation is concerned with the development of Markov chain Monte Carlo (MCMC) methods for the Bayesian restoration of degraded audio signals. First, the Bayesian approach to time series modelling is reviewed, then established MCMC methods are introduced. The first problem to be addressed is that of model order uncertainty. A reversible-jump sampler is proposed which can move between models of different order. It is shown that faster convergence can be achieved by exploiting the analytic structure of the time series model. This approach to model order uncertainty is applied to the problem of noise reduction using the simulation smoother. The effects of incorrect autoregressive (AR) model orders are demonstrated, and a mixed model order MCMC noise reduction scheme is developed. Nonlinear time series models are surveyed, and the advantages of linear-in- the-parameters models explained. A nonlinear AR (NAR) model, ...

Troughton, Paul Thomas — University of Cambridge


Adaptive Sparse Coding and Dictionary Selection

The sparse coding is approximation/representation of signals with the minimum number of coefficients using an overcomplete set of elementary functions. This kind of approximations/ representations has found numerous applications in source separation, denoising, coding and compressed sensing. The adaptation of the sparse approximation framework to the coding problem of signals is investigated in this thesis. Open problems are the selection of appropriate models and their orders, coefficient quantization and sparse approximation method. Some of these questions are addressed in this thesis and novel methods developed. Because almost all recent communication and storage systems are digital, an easy method to compute quantized sparse approximations is introduced in the first part. The model selection problem is investigated next. The linear model can be adapted to better fit a given signal class. It can also be designed based on some a priori information ...

Yaghoobi, Mehrdad — University of Edinburgh


Nonlinear analysis of speech from a synthesis perspective

With the emergence of nonlinear dynamical systems analysis over recent years it has become clear that conventional time domain and frequency domain approaches to speech synthesis may be far from optimal. Using state space reconstructions of the time domain speech signal it is, at least in theory, possible to investigate a number of invariant geometrical measures for the underlying system which give a more thorough understanding of the dynamics of the system and therefore the form that any model should take. This thesis introduces a number of nonlinear dynamical analysis tools which are then applied to a database of vowels to extract the underlying invariant geometrical properties. The results of this analysis are then applied, using ideas taken from nonlinear dynamics, to the problem of speech synthesis and a novel synthesis technique is described and demonstrated. The tools used for ...

Banbrook, Mike — University Of Edinburgh


Music Language Models for Automatic Music Transcription

Much like natural language, music is highly structured, with strong priors on the likelihood of note sequences. In automatic speech recognition (ASR), these priors are called language models, which are used in addition to acoustic models and participate greatly to the success of today's systems. However, in Automatic Music Transcription (AMT), ASR's musical equivalent, Music Language Models (MLMs) are rarely used. AMT can be defined as the process of extracting a symbolic representation from an audio signal, describing which notes were played at what time. In this thesis, we investigate the design of MLMs using recurrent neural networks (RNNs) and their use for AMT. We first look into MLM performance on a polyphonic prediction task. We observe that using musically-relevant timesteps results in desirable MLM behaviour, which is not reflected in usual evaluation metrics. We compare our model against benchmark ...

Ycart, Adrien — Queen Mary University of London


Audio Signal Processing for Binaural Reproduction with Improved Spatial Perception

Binaural technology aims to reproduce three-dimensional auditory scenes with a high level of realism by providing the auditory display with spatial hearing information. This technology has various applications in virtual acoustics, architectural acoustics, telecommunication and auditory science. One key element in binaural technology is the actual binaural signals, produced by filtering a sound-field with free-field head related transfer functions (HRTFs). With the increased popularity of spherical microphone arrays for sound-field recording, methods have been developed for rendering binaural signals from these recordings. The use of spherical arrays naturally leads to processing methods that are formulated in the spherical harmonics (SH) domain. For accurate SH representation, high-order functions, of both the sound-field and the HRTF, are required. However, a limited number of microphones, on one hand, and challenges in acquiring high resolution individual HRTFs, on the other hand, impose limitations on ...

Ben-Hur, Zamir — Ben-Gurion University of the Negev


Speech derereverberation in noisy environments using time-frequency domain signal models

Reverberation is the sum of reflected sound waves and is present in any conventional room. Speech communication devices such as mobile phones in hands-free mode, tablets, smart TVs, teleconferencing systems, hearing aids, voice-controlled systems, etc. use one or more microphones to pick up the desired speech signals. When the microphones are not in the proximity of the desired source, strong reverberation and noise can degrade the signal quality at the microphones and can impair the intelligibility and the performance of automatic speech recognizers. Therefore, it is a highly demanded task to process the microphone signals such that reverberation and noise are reduced. The process of reducing or removing reverberation from recorded signals is called dereverberation. As dereverberation is usually a completely blind problem, where the only available information are the microphone signals, and as the acoustic scenario can be non-stationary, ...

Braun, Sebastian — Friedrich-Alexander Universität Erlangen-Nürnberg


Dialogue Enhancement and Personalization - Contributions to Quality Assessment and Control

The production and delivery of audio for television involve many creative and technical challenges. One of them is concerned with the level balance between the foreground speech (also referred to as dialogue) and the background elements, e.g., music, sound effects, and ambient sounds. Background elements are fundamental for the narrative and for creating an engaging atmosphere, but they can mask the dialogue, which the audience wishes to follow in a comfortable way. Very different individual factors of the people in the audience clash with the creative freedom of the content creators. As a result, service providers receive regular complaints about difficulties in understanding the dialogue because of too loud background sounds. While this has been a known issue for at least three decades, works analyzing the problem and up-to-date statics were scarce before the contributions in this work. Enabling the ...

Torcoli, Matteo — Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)


Modern Optimization Methods for Interpolation of Missing Sections in Audio Signals

Damage to audio signals is in practice common, yet undesirable. Information loss can occur due to improper recording (low sample rate or dynamic range), transmission error (sample dropout), media damage, or because of noise. The removal of such disturbances is possible using inverse problems. Specifically, this work focuses on the situation where sections of an audio signal of length in the order of tens of milliseconds are completely lost, and the goal is to interpolate the missing samples based on the unimpaired context and a suitable signal model. The first part of the dissertation is devoted to convex and non-convex optimization methods, which are designed to find a solution to the interpolation problem based on the assumption of sparsity of the time-frequency spectrum. The general background and some algorithms are taken from the literature and adapted to the interpolation problem, ...

Mokrý, Ondřej — Brno University of Technology


Multi-microphone speech enhancement: An integration of a priori and data-dependent spatial information

A speech signal captured by multiple microphones is often subject to a reduced intelligibility and quality due to the presence of noise and room acoustic interferences. Multi-microphone speech enhancement systems therefore aim at the suppression or cancellation of such undesired signals without substantial distortion of the speech signal. A fundamental aspect to the design of several multi-microphone speech enhancement systems is that of the spatial information which relates each microphone signal to the desired speech source. This spatial information is unknown in practice and has to be somehow estimated. Under certain conditions, however, the estimated spatial information can be inaccurate, which subsequently degrades the performance of a multi-microphone speech enhancement system. This doctoral dissertation is focused on the development and evaluation of acoustic signal processing algorithms in order to address this issue. Specifically, as opposed to conventional means of estimating ...

Ali, Randall — KU Leuven


Music Pre-Processing for Cochlear Implants

A Cochlear Implant (CI) is a medical device that enables profoundly hearing impaired people to perceive sounds by electrically stimulating the auditory nerve using an electrode array implanted in the cochlea. The focus of most research on signal processing for CIs has been on strategies to improve speech understanding in quiet and in background noise, since the main aim for implanting a CI was (and still is) to restore the ability to communicate. Most CI users perform quite well in terms of speech understanding. On the other hand, music perception and appreciation are generally very poor. The main goal of this PhD project was to investigate and to improve the poor music enjoyment in CI users. An initial experiment with multi-track recordings was carried out to examine the music mixing preferences for different instruments in polyphonic or complex music. In ...

Buyefns, Wim — KU Leuven


An iterative, residual-based approach to unsupervised musical source separation in single-channel mixtures

This thesis concentrates on a major problem within audio signal processing, the separation of source signals from musical mixtures when only a single mixture channel is available. Source separation is the process by which signals that correspond to distinct sources are identified in a signal mixture and extracted from it. Producing multiple entities from a single one is an extremely underdetermined task, so additional prior information can assist in setting appropriate constraints on the solution set. The approach proposed uses prior information such that: (1) it can potentially be applied successfully to a large variety of musical mixtures, and (2) it requires minimal user intervention and no prior learning/training procedures (i.e., it is an unsupervised process). This system can be useful for applications such as remixing, creative effects, restoration and for archiving musical material for internet delivery, amongst others. Here, ...

Siamantas, Georgios — University of York

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.