Lapped Nonuniform Orthogonal Transforms with Compact Support

Filterbanks are an integral part of most perceptual coder systems, tasked with shaping the noise produced by the quantizer in the en- coder. Because of this shaping, the quantizer noise can then be con- trolled to stay below the masking threshold of the human ear, and become inaudible. In most current perceptual coders, an unmodified MDCT is used as the filterbank, as the MDCT has many properties that make it a good choice for this scenario. One disadvantage, however, is the uniform time-frequency reso- lution of the MDCT. This stands in contrast to the human auditory system, which has a non-uniform time-frequency resolution. This mis- match results in an unexploited gap that, if closed, could lead to a more efficient perceptual audio coder. Previous work has attempted to design non-uniform filterbanks us- ing MDCTs and subband merging, but with system ...

Werner, Nils — Friedrich-Alexander-Universität Erlangen-Nürnberg


Efficient Perceptual Audio Coding Using Cosine and Sine Modulated Lapped Transforms

The increasing number of simultaneous input and output channels utilized in immersive audio configurations primarily in broadcasting applications has renewed industrial requirements for efficient audio coding schemes with low bit-rate and complexity. This thesis presents a comprehensive review and extension of conventional approaches for perceptual coding of arbitrary multichannel audio signals. Particular emphasis is given to use cases ranging from two-channel stereophonic to six-channel 5.1-surround setups with or without the application-specific constraint of low algorithmic coding latency. Conventional perceptual audio codecs share six common algorithmic components, all of which are examined extensively in this thesis. The first is a signal-adaptive filterbank, constructed using instances of the real-valued modified discrete cosine transform (MDCT), to obtain spectral representations of successive portions of the incoming discrete time signal. Within this MDCT spectral domain, various intra- and inter-channel optimizations, most of which are of ...

Helmrich, Christian R. — Friedrich-Alexander-Universität Erlangen-Nürnberg


The Removal of Environmental Noise in Cellular Communications by Perceptual Techniques

This thesis describes the application of a perceptually based spectral subtraction algorithm for the enhancement of non-stationary noise corrupted speech. Through examination of speech enhancement techniques, explanations are given for the choice of magnitude spectral subtraction and how the human auditory system can be modelled for frequency domain speech enhancement. It is discovered, that the cochlea provides the mechanical speech enhancement in the auditory system, through the use of masking. Frequency masking is used in spectral subtraction, to improve the algorithm execution time, and to shape the enhancement process making it sound natural to the ear. A new technique for estimation of background noise is presented, which operates during speech sections as well as pauses. This uses two microphones placed on opposite ends of the cellular handset. Using these, the algorithm determines whether the signal is speech, or noise, by ...

Tuffy, Mark — University Of Edinburgh


Sparsity in Linear Predictive Coding of Speech

This thesis deals with developing improved modeling methods for speech and audio processing based on the recent developments in sparse signal representation. In particular, this work is motivated by the need to address some of the limitations of the well-known linear prediction (LP) based all-pole models currently applied in many modern speech and audio processing systems. In the first part of this thesis, we introduce \emph{Sparse Linear Prediction}, a set of speech processing tools created by introducing sparsity constraints into the LP framework. This approach defines predictors that look for a sparse residual rather than a minimum variance one, with direct applications to coding but also consistent with the speech production model of voiced speech, where the excitation of the all-pole filter is model as an impulse train. Introducing sparsity in the LP framework, will also bring to develop the ...

Giacobello, Daniele — Aalborg University


Mixed structural models for 3D audio in virtual environments

In the world of Information and communications technology (ICT), strategies for innovation and development are increasingly focusing on applications that require spatial representation and real-time interaction with and within 3D-media environments. One of the major challenges that such applications have to address is user-centricity, reflecting e.g. on developing complexity-hiding services so that people can personalize their own delivery of services. In these terms, multimodal interfaces represent a key factor for enabling an inclusive use of new technologies by everyone. In order to achieve this, multimodal realistic models that describe our environment are needed, and in particular models that accurately describe the acoustics of the environment and communication through the auditory modality are required. Examples of currently active research directions and application areas include 3DTV and future internet, 3D visual-sound scene coding, transmission and reconstruction and teleconferencing systems, to name but ...

Geronazzo, Michele — University of Padova


Source-Filter Model Based Single Channel Speech Separation

In a natural acoustic environment, multiple sources are usually active at the same time. The task of source separation is the estimation of individual source signals from this complex mixture. The challenge of single channel source separation (SCSS) is to recover more than one source from a single observation. Basically, SCSS can be divided in methods that try to mimic the human auditory system and model-based methods, which find a probabilistic representation of the individual sources and employ this prior knowledge for inference. This thesis presents several strategies for the separation of two speech utterances mixed into a single channel and is structured in four parts: The first part reviews factorial models in model-based SCSS and introduces the soft-binary mask for signal reconstruction. This mask shows improved performance compared to the soft and the binary masks in automatic speech recognition ...

Stark, Michael — Graz University of Technology


Joint Source-Cryptographic-Channel Coding for Real-Time Secure Voice Communications on Voice Channels

The growing risk of privacy violation and espionage associated with the rapid spread of mobile communications renewed interest in the original concept of sending encrypted voice as audio signal over arbitrary voice channels. The usual methods used for encrypted data transmission over analog telephony turned out to be inadequate for modern vocal links (cellular networks, VoIP) equipped with voice compression, voice activity detection, and adaptive noise suppression algorithms. The limited available bandwidth, nonlinear channel distortion, and signal fadings motivate the investigation of a dedicated, joint approach for speech encoding and encryption adapted to modern noisy voice channels. This thesis aims to develop, analyze, and validate secure and efficient schemes for real-time speech encryption and transmission via modern voice channels. In addition to speech encryption, this study covers the security and operational aspects of the whole voice communication system, as this ...

Krasnowski, Piotr — Université Côte d'Azur


Perceptually-Based Signal Features for Environmental Sound Classification

This thesis faces the problem of automatically classifying environmental sounds, i.e., any non-speech or non-music sounds that can be found in the environment. Broadly speaking, two main processes are needed to perform such classification: the signal feature extraction so as to compose representative sound patterns and the machine learning technique that performs the classification of such patterns. The main focus of this research is put on the former, studying relevant signal features that optimally represent the sound characteristics since, according to several references, it is a key issue to attain a robust recognition. This type of audio signals holds many differences with speech or music signals, thus specific features should be determined and adapted to their own characteristics. In this sense, new signal features, inspired by the human auditory system and the human perception of sound, are proposed to improve ...

Valero, Xavier — La Salle-Universitat Ramon Llull


Embedded Optimization Algorithms for Perceptual Enhancement of Audio Signals

This thesis investigates the design and evaluation of an embedded optimization framework for the perceptual enhancement of audio signals which are degraded by linear and/or nonlinear distortion. In general, audio signal enhancement has the goal to improve the perceived audio quality, speech intelligibility, or another desired perceptual attribute of the distorted audio signal by applying a real-time digital signal processing algorithm. In the designed embedded optimization framework, the audio signal enhancement problem under consideration is formulated and solved as a per-frame numerical optimization problem, allowing to compute the enhanced audio signal frame that is optimal according to a desired perceptual attribute. The first stage of the embedded optimization framework consists in the formulation of the per-frame optimization problem aimed at maximally enhancing the desired perceptual attribute, by explicitly incorporating a suitable model of human sound perception. The second stage of ...

Defraene, Bruno — KU Leuven


Exploiting Correlation Noise Modeling in Wyner-Ziv Video Coding

Wyner-Ziv (WZ) video coding is a particular case of distributed video coding, a new video coding paradigm based on the Slepian-Wolf and Wyner-Ziv theorems which mainly exploit the source correlation at the decoder and not only at the encoder as in predictive video coding. Therefore, this new coding paradigm may provide a flexible allocation of complexity between the encoder and the decoder and in-built channel error robustness, interesting features for emerging applications such as low-power video surveillance and visual sensor networks among others. Although some progress has been made in the last eight years, the rate-distortion performance of WZ video coding is still far from the maximum performance attained with predictive video coding. The WZ video coding compression efficiency depends critically on the capability to model the correlation noise between the original information at the encoder and its estimation generated ...

Brites, Catarina — Instituto Superior Tecnico (IST)


Advances in Perceptual Stereo Audio Coding Using Linear Prediction Techniques

A wide range of techniques for coding a single-channel speech and audio signal has been developed over the last few decades. In addition to pure redundancy reduction, sophisticated source and receiver models have been considered for reducing the bit-rate. Traditionally, speech and audio coders are based on different principles and thus each of them offers certain advantages. With the advent of high capacity channels, networks, and storage systems, the bit-rate versus quality compromise will no longer be the major issue; instead, attributes like low-delay, scalability, computational complexity, and error concealments in packet-oriented networks are expected to be the major selling factors. Typical audio coders such as MP3 and AAC are based on subband or transform coding techniques that are not easily reconcilable with a low-delay requirement. The reasons for their inherently longer delay are the relatively long band splitting filters ...

Biswas, Arijit — Technische Universiteit Eindhoven


A Computational Framework for Sound Segregation in Music Signals

Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to ex- tract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical ar- rangement, the sound sources and events occurring in a complex musical mixture, the song structure (e.g. verse, chorus, bridge) and the musical genre of a piece, are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues ...

Martins, Luis Gustavo — Universidade do Porto


Audio Signal Processing for Binaural Reproduction with Improved Spatial Perception

Binaural technology aims to reproduce three-dimensional auditory scenes with a high level of realism by providing the auditory display with spatial hearing information. This technology has various applications in virtual acoustics, architectural acoustics, telecommunication and auditory science. One key element in binaural technology is the actual binaural signals, produced by filtering a sound-field with free-field head related transfer functions (HRTFs). With the increased popularity of spherical microphone arrays for sound-field recording, methods have been developed for rendering binaural signals from these recordings. The use of spherical arrays naturally leads to processing methods that are formulated in the spherical harmonics (SH) domain. For accurate SH representation, high-order functions, of both the sound-field and the HRTF, are required. However, a limited number of microphones, on one hand, and challenges in acquiring high resolution individual HRTFs, on the other hand, impose limitations on ...

Ben-Hur, Zamir — Ben-Gurion University of the Negev


Prediction and Optimization of Speech Intelligibility in Adverse Conditions

In digital speech-communication systems like mobile phones, public address systems and hearing aids, conveying the message is one of the most important goals. This can be challenging since the intelligibility of the speech may be harmed at various stages before, during and after the transmission process from sender to receiver. Causes which create such adverse conditions include background noise, an unreliable internet connection during a Skype conversation or a hearing impairment of the receiver. To overcome this, many speech-communication systems include speech processing algorithms to compensate for these signal degradations like noise reduction. To determine the effect on speech intelligibility of these signal processing based solutions, the speech signal has to be evaluated by means of a listening test with human listeners. However, such tests are costly and time consuming. As an alternative, reliable and fast machine-driven intelligibility predictors are ...

Taal, Cees — Delft University of Technology


Integrating monaural and binaural cues for sound localization and segregation in reverberant environments

The problem of segregating a sound source of interest from an acoustic background has been extensively studied due to applications in hearing prostheses, robust speech/speaker recognition and audio information retrieval. Computational auditory scene analysis (CASA) approaches the segregation problem by utilizing grouping cues involved in the perceptual organization of sound by human listeners. Binaural processing, where input signals resemble those that enter the two ears, is of particular interest in the CASA field. The dominant approach to binaural segregation has been to derive spatially selective filters in order to enhance the signal in a direction of interest. As such, the problems of sound localization and sound segregation are closely tied. While spatial filtering has been widely utilized, substantial performance degradation is incurred in reverberant environments and more fundamentally, segregation cannot be performed without sufficient spatial separation between sources. This dissertation ...

Woodruff, John — The Ohio State University

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.