Perceptually-Based Signal Features for Environmental Sound Classification

This thesis faces the problem of automatically classifying environmental sounds, i.e., any non-speech or non-music sounds that can be found in the environment. Broadly speaking, two main processes are needed to perform such classification: the signal feature extraction so as to compose representative sound patterns and the machine learning technique that performs the classification of such patterns. The main focus of this research is put on the former, studying relevant signal features that optimally represent the sound characteristics since, according to several references, it is a key issue to attain a robust recognition. This type of audio signals holds many differences with speech or music signals, thus specific features should be determined and adapted to their own characteristics. In this sense, new signal features, inspired by the human auditory system and the human perception of sound, are proposed to improve ...

Valero, Xavier — La Salle-Universitat Ramon Llull


Interpretable Machine Learning for Machine Listening

Recent years have witnessed a significant interest in interpretable machine learning (IML) research that develops techniques to analyse machine learning (ML) models. Understanding ML models is essential to gain trust in their predictions and to improve datasets, model architectures and training techniques. The majority of effort in IML research has been in analysing models that classify images or structured data and comparatively less work exists that analyses models for other domains. This research focuses on developing novel IML methods and on extending existing methods to understand machine listening models that analyse audio. In particular, this thesis reports the results of three studies that apply three different IML methods to analyse five singing voice detection (SVD) models that predict singing voice activity in musical audio excerpts. The first study introduces SoundLIME (SLIME), a method to generate temporal, spectral or time-frequency explanations ...

Mishra, Saumitra — Queen Mary University of London


A Geometric Deep Learning Approach to Sound Source Localization and Tracking

The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy ...

Diaz-Guerra, David — University of Zaragoza


Wavelet Analysis For Robust Speech Processing and Applications

In this work, we study the application of wavelet analysis for robust speech processing. Reliable time-scale features (TS) which characterize the relevant phonetic classes such as voiced (V), unvoiced (UV), silence (S), mixed-excitation, and stop sounds are extracted. By training neural and Bayesian networks, the classification rates provided by only 7 TS features are mostly similar to the ones obtained by 13 MFCC features. The TS features are further enhanced to design a reliable and low-complexity V/UV/S classifier. Quantile filtering and slope tracking are used for deriving adaptive thresholds. A robust voice activity detector is then built and used as a pre-processing stage to improve the performance of a speaker verification system. Based on wavelet shrinkage, a statistical wavelet filtering (SWF) method is designed for speech enhancement. Non-stationary and colored noise is handled by employing quantile filtering and time-frequency adaptive ...

Pham, Van Tuan — Graz University of Technology


Robust Lung Sound and Acoustic Scene Classification

Auscultation with a stethoscope enables us to recognize pathological changes of the lung. It is a fast and inexpensive diagnosis method. However, it has several disadvantages: subjectiveness, i.e. the lung sound evaluation depends on the experience of physicians, can not provide continuous monitoring and a trained expert is required. Furthermore, the characteristics of the lung sounds are in the low frequency range, where the human hearing has limited sensitivity and is susceptible to noise artifacts. Exploiting the advances in digital recording devices, signal processing and machine learning, computational methods for the analysis of lung sounds have been a successful and effective approach. Computational lung sound analysis is beneficial for computer-supported diagnosis, digital storage and monitoring in critical care. Beside computational lung sound analysis, the recognition of acoustic contextual information is important in various applications. The motivation for recent research on ...

Truc Nguyen — SPSC - TUGraz


Application of Sound Source Separation Methods to Advanced Spatial Audio Systems

This thesis is related to the field of Sound Source Separation (SSS). It addresses the development and evaluation of these techniques for their application in the resynthesis of high-realism sound scenes by means of Wave Field Synthesis (WFS). Because the vast majority of audio recordings are preserved in two-channel stereo format, special up-converters are required to use advanced spatial audio reproduction formats, such as WFS. This is due to the fact that WFS needs the original source signals to be available, in order to accurately synthesize the acoustic field inside an extended listening area. Thus, an object-based mixing is required. Source separation problems in digital signal processing are those in which several signals have been mixed together and the objective is to find out what the original signals were. Therefore, SSS algorithms can be applied to existing two-channel mixtures to ...

Cobos, Maximo — Universidad Politecnica de Valencia


Oscillator-plus-Noise Modeling of Speech Signals

In this thesis we examine the autonomous oscillator model for synthesis of speech signals. The contributions comprise an analysis of realizations and training methods for the nonlinear function used in the oscillator model, the combination of the oscillator model with inverse filtering, both significantly increasing the number of `successfully' re-synthesized speech signals, and the introduction of a new technique suitable for the re-generation of the noise-like signal component in speech signals. Nonlinear function models are compared in a one-dimensional modeling task regarding their presupposition for adequate re-synthesis of speech signals, in particular considering stability. The considerations also comprise the structure of the nonlinear functions, with the aspect of the possible interpolation between models for different speech sounds. Both regarding stability of the oscillator and the premiss of a nonlinear function structure that may be pre-defined, RBF networks are found a ...

Rank, Erhard — Vienna University of Technology


Emotion assessment for affective computing based on brain and peripheral signals

Current Human-Machine Interfaces (HMI) lack of “emotional intelligence”, i.e. they are not able to identify human emotional states and take this information into account to decide on the proper actions to execute. The goal of affective computing is to fill this lack by detecting emotional cues occurring during Human-Computer Interaction (HCI) and synthesizing emotional responses. In the last decades, most of the studies on emotion assessment have focused on the analysis of facial expressions and speech to determine the emotional state of a person. Physiological activity also includes emotional information that can be used for emotion assessment but has received less attention despite of its advantages (for instance it can be less easily faked than facial expressions). This thesis reports on the use of two types of physiological activities to assess emotions in the context of affective computing: the activity ...

Chanel, Guillaume — University of Geneva


Audio-visual processing and content management techniques, for the study of (human) bioacoustics phenomena

The present doctoral thesis aims towards the development of new long-term, multi-channel, audio-visual processing techniques for the analysis of bioacoustics phenomena. The effort is focused on the study of the physiology of the gastrointestinal system, aiming at the support of medical research for the discovery of gastrointestinal motility patterns and the diagnosis of functional disorders. The term "processing" in this case is quite broad, incorporating the procedures of signal processing, content description, manipulation and analysis, that are applied to all the recorded bioacoustics signals, the auxiliary audio-visual surveillance information (for the monitoring of experiments and the subjects' status), and the extracted audio-video sequences describing the abdominal sound-field alterations. The thesis outline is as follows. The main objective of the thesis, which is the technological support of medical research, is presented in the first chapter. A quick problem definition is initially ...

Dimoulas, Charalampos — Department of Electrical and Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece


Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


High-Quality Vocoding Design with Signal Processing for Speech Synthesis and Voice Conversion

This Ph.D. thesis focuses on developing a system for high-quality speech synthesis and voice conversion. Vocoder-based speech analysis, manipulation, and synthesis plays a crucial role in various kinds of statistical parametric speech research. Although there are vocoding methods which yield close to natural synthesized speech, they are typically computationally expensive, and are thus not suitable for real-time implementation, especially in embedded environments. Therefore, there is a need for simple and computationally feasible digital signal processing algorithms for generating high-quality and natural-sounding synthesized speech. In this dissertation, I propose a solution to extract optimal acoustic features and a new waveform generator to achieve higher sound quality and conversion accuracy by applying advances in deep learning. The approach remains computationally efficient. This challenge resulted in five thesis groups, which are briefly summarized below. I introduce firstly a new method to shape the ...

Al-Radhi Mohammed Salah — Budapest University of Technology and Economics


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Diplophonic Voice - Definitions, models, and detection

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been ...

Aichinger, Philipp — Division of Phoniatrics-Logopedics, Department of Otorhinolaryngology, Medical University of Vienna; Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria


Advanced Signal Processing Concepts for Multi-Dimensional Communication Systems

The widespread use of mobile internet and smart applications has led to an explosive growth in mobile data traffic. With the rise of smart homes, smart buildings, and smart cities, this demand is ever growing since future communication systems will require the integration of multiple networks serving diverse sectors, domains and applications, such as multimedia, virtual or augmented reality, machine-to-machine (M2M) communication / the Internet of things (IoT), automotive applications, and many more. Therefore, in the future, the communication systems will not only be required to provide Gbps wireless connectivity but also fulfill other requirements such as low latency and massive machine type connectivity while ensuring the quality of service. Without significant technological advances to increase the system capacity, the existing telecommunications infrastructure will be unable to support these multi-dimensional requirements. This poses an important demand for suitable waveforms with ...

Cheema, Sher Ali — Technische Universität Ilmenau


SPACE-TIME PARAMETRIC APPROACH TO EXTENDED AUDIO REALITY (SP-EAR)

The term extended reality refers to all possible interactions between real and virtual (computed generated) elements and environments. The extended reality field is rapidly growing, primarily through augmented and virtual reality applications. The former allows users to bring digital elements into the real world, while the latter lets us experience and interact with an entirely virtual environment. While currently extended reality implementations primarily focus on the visual domain, we cannot underestimate the impact of auditory perception in order to provide a fully immersive experience. As a matter of fact, effective handling of the acoustic content is able to enrich the engagement of users. We refer to Extended Audio Reality (EAR) as the subset of extended reality operations related to the audio domain. In this thesis, we propose a parametric approach to EAR conceived in order to provide an effective and ...

Pezzoli Mirco — Politecnico di Milano

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.