Acoustic sensor network geometry calibration and applications

In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization ...

Plinge, Axel — TU Dortmund University


Sound Event Detection by Exploring Audio Sequence Modelling

Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing ...

[Pankajakshan], [Arjun] — Queen Mary University of London


Efficient parametric modeling, identification and equalization of room acoustics

Room acoustic signal enhancement (RASE) applications, such as digital equalization, acoustic echo and feedback cancellation, which are commonly found in communication devices and audio equipment, aim at processing the acoustic signals with the final goal of improving the perceived sound quality in rooms. In order to do so, signal processing algorithms require the acoustic response of the room to be represented by means of parametric models and to be identified from the input and output signals of the room acoustic system. In particular, a good model should be both accurate, thus capturing those features of room acoustics that are physically and perceptually most relevant, and efficient, so that it can be implemented as a digital filter and used in practical signal processing tasks. This thesis addresses the fundamental question in room acoustic signal processing concerning the appropriateness of different parametric ...

Vairetti, Giacomo — KU Leuven


Digital Processing Based Solutions for Life Science Engineering Recognition Problems

The field of Life Science Engineering (LSE) is rapidly expanding and predicted to grow strongly in the next decades. It covers areas of food and medical research, plant and pests’ research, and environmental research. In each research area, engineers try to find equations that model a certain life science problem. Once found, they research different numerical techniques to solve for the unknown variables of these equations. Afterwards, solution improvement is examined by adopting more accurate conventional techniques, or developing novel algorithms. In particular, signal and image processing techniques are widely used to solve those LSE problems require pattern recognition. However, due to the continuous evolution of the life science problems and their natures, these solution techniques can not cover all aspects, and therefore demanding further enhancement and improvement. The thesis presents numerical algorithms of digital signal and image processing to ...

Hussein, Walid — Technische Universität München


Mixed structural models for 3D audio in virtual environments

In the world of Information and communications technology (ICT), strategies for innovation and development are increasingly focusing on applications that require spatial representation and real-time interaction with and within 3D-media environments. One of the major challenges that such applications have to address is user-centricity, reflecting e.g. on developing complexity-hiding services so that people can personalize their own delivery of services. In these terms, multimodal interfaces represent a key factor for enabling an inclusive use of new technologies by everyone. In order to achieve this, multimodal realistic models that describe our environment are needed, and in particular models that accurately describe the acoustics of the environment and communication through the auditory modality are required. Examples of currently active research directions and application areas include 3DTV and future internet, 3D visual-sound scene coding, transmission and reconstruction and teleconferencing systems, to name but ...

Geronazzo, Michele — University of Padova


Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora

When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE), involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system’s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a ...

Boril, Hynek — Czech Technical University in Prague


Deep Learning for Audio Effects Modeling

Audio effects modeling is the process of emulating an audio effect unit and seeks to recreate the sound, behaviour and main perceptual features of an analog reference device. Audio effect units are analog or digital signal processing systems that transform certain characteristics of the sound source. These transformations can be linear or nonlinear, time-invariant or time-varying and with short-term and long-term memory. Most typical audio effect transformations are based on dynamics, such as compression; tone such as distortion; frequency such as equalization; and time such as artificial reverberation or modulation based audio effects. The digital simulation of these audio processors is normally done by designing mathematical models of these systems. This is often difficult because it seeks to accurately model all components within the effect unit, which usually contains mechanical elements together with nonlinear and time-varying analog electronics. Most existing ...

Martínez Ramírez, Marco A — Queen Mary University of London


Synthetic reproduction of head-related transfer functions by using microphone arrays

Spatial hearing for human listeners is based on the interaural as well as on the monaural analysis of the signals arriving at both ears, enabling the listeners to assign certain spatial components to these signals. This spatial aspect gets lost when the signals are reproduced via headphones without considering the acoustical influence of the head and torso, i.e. head-related transfer function (HRTFs). A common procedure to take into account spatial aspects in a binaural reproduction is to use so-called artificial heads. Artificial heads are replicas of a human head and torso with average anthropometric geometries and built-in microphones in the ears. Although, the signals recorded with artificial heads contain relevant spatial aspects, binaural recordings using artificial heads often suffer from front-back confusions and the perception of the sound source being inside the head (internalization). These shortcomings can be attributed to ...

Rasumow, Eugen — University of Oldenburg


Audio Watermarking, Steganalysis Using Audio Quality Metrics, and Robust Audio Hashing

We propose a technique for the problem of detecting the very presence of hidden messages in an audio object. The detector is based on the characteristics of the denoised residuals of the audio file. Our proposition is established upon the presupposition that the hidden message in a cover object leaves statistical evidence that can be detected with the use of some audio distortion measures. The distortions caused by hidden message are measured in terms of objective and perceptual quality metrics. The detector discriminates between cover and stego files using a selected subset of features and an SVM classifier. We have evaluated the detection performance of the proposed steganalysis technique with the well-known watermarking and steganographic methods. We present novel and robust audio fingerprinting techniques based on the summarization of the time-frequency spectral characteristics of an audio object. The perceptual hash ...

Ozer, Hamza — Bogazici University


A Computational Framework for Sound Segregation in Music Signals

Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to ex- tract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical ar- rangement, the sound sources and events occurring in a complex musical mixture, the song structure (e.g. verse, chorus, bridge) and the musical genre of a piece, are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues ...

Martins, Luis Gustavo — Universidade do Porto


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Emotion assessment for affective computing based on brain and peripheral signals

Current Human-Machine Interfaces (HMI) lack of “emotional intelligence”, i.e. they are not able to identify human emotional states and take this information into account to decide on the proper actions to execute. The goal of affective computing is to fill this lack by detecting emotional cues occurring during Human-Computer Interaction (HCI) and synthesizing emotional responses. In the last decades, most of the studies on emotion assessment have focused on the analysis of facial expressions and speech to determine the emotional state of a person. Physiological activity also includes emotional information that can be used for emotion assessment but has received less attention despite of its advantages (for instance it can be less easily faked than facial expressions). This thesis reports on the use of two types of physiological activities to assess emotions in the context of affective computing: the activity ...

Chanel, Guillaume — University of Geneva


Diplophonic Voice - Definitions, models, and detection

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been ...

Aichinger, Philipp — Division of Phoniatrics-Logopedics, Department of Otorhinolaryngology, Medical University of Vienna; Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria


Oscillator-plus-Noise Modeling of Speech Signals

In this thesis we examine the autonomous oscillator model for synthesis of speech signals. The contributions comprise an analysis of realizations and training methods for the nonlinear function used in the oscillator model, the combination of the oscillator model with inverse filtering, both significantly increasing the number of `successfully' re-synthesized speech signals, and the introduction of a new technique suitable for the re-generation of the noise-like signal component in speech signals. Nonlinear function models are compared in a one-dimensional modeling task regarding their presupposition for adequate re-synthesis of speech signals, in particular considering stability. The considerations also comprise the structure of the nonlinear functions, with the aspect of the possible interpolation between models for different speech sounds. Both regarding stability of the oscillator and the premiss of a nonlinear function structure that may be pre-defined, RBF networks are found a ...

Rank, Erhard — Vienna University of Technology


Contributions to the Information Fusion : application to Obstacle Recognition in Visible and Infrared Images

The interest for the intelligent vehicle field has been increased during the last years, must probably due to an important number of road accidents. Many accidents could be avoided if a device attached to the vehicle would assist the driver with some warnings when dangerous situations are about to appear. In recent years, leading car developers have recorded significant efforts and support research works regarding the intelligent vehicle field where they propose solutions for the existing problems, especially in the vision domain. Road detection and following, pedestrian or vehicle detection, recognition and tracking, night vision, among others are examples of applications which have been developed and improved recently. Still, a lot of challenges and unsolved problems remain in the intelligent vehicle domain. Our purpose in this thesis is to design an Obstacle Recognition system for improving the road security by ...

Apatean, Anca Ioana — Institut National des Sciences Appliquées de Rouen

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.