Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Despite a lot of progress in speech separation, enhancement, and automatic speech recognition realistic meeting recognition is still fairly unsolved. Most research on speech separation either focuses on spectral cues to address single-channel recordings or spatial cues to separate multi-channel recordings and exclusively either rely on neural networks or probabilistic graphical models. Integrating a spatial clustering approach and a deep learning approach using spectral cues in a single framework can significantly improve automatic speech recognition performance and improve generalizability given that a neural network profits from a vast amount of training data while the probabilistic counterpart adapts to the current scene. This thesis at hand, therefore, concentrates on the integration of two fairly disjoint research streams, namely single-channel deep learning-based source separation and multi-channel probabilistic model-based source separation. It provides a general framework to integrate spatial and spectral cues in ...

Drude, Lukas — Paderborn University


Signal Quantization and Approximation Algorithms for Federated Learning

Distributed signal or information processing using Internet of Things (IoT), facilitates real-time monitoring of signals, for example, environmental pollutants, health indicators, and electric energy consumption in a smart city. Despite the promising capabilities of IoTs, these distributed deployments often face the challenge of data privacy and communication rate constraints. In traditional machine learning, training data is moved to a data center, which requires massive data movement from distributed IoT devices to a third-party location, thus raising concerns over privacy and inefficient use of communication resources. Moreover, the growing network size, model size, and data volume combined lead to unusual complexity in the design of optimization algorithms beyond the compute capability of a single device. This necessitates novel system architectures to ensure stable and secure operations of such networks. Federated learning (FL) architecture, a novel distributed learning paradigm introduced by McMahan ...

A, Vijay — Indian Institute of Technology Bombay


Deep neural networks for source separation and noise-robust speech recognition

This thesis addresses the problem of multichannel audio source separation by exploiting deep neural networks (DNNs). We build upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources are characterized by their power spectral densities and their source spatial covariance matrices. We explore and optimize the use of DNNs for estimating these spectral and spatial parameters. Employing the estimated source parameters, we then derive a time-varying multichannel Wiener filter for the separation of each source. We extensively study the impact of various design choices for the spectral and spatial DNNs. We consider different cost functions, time-frequency representations, architectures, and training data sizes. Those cost functions notably include a newly proposed task-oriented signal-to-distortion ratio cost function for spectral DNNs. Furthermore, we present a weighted spatial parameter estimation formula, which generalizes the corresponding exact ...

Nugraha, Aditya Arie — Université de Lorraine


Non-linear Spatial Filtering for Multi-channel Speech Enhancement

A large part of human speech communication takes place in noisy environments and is supported by technical devices. For example, a hearing-impaired person might use a hearing aid to take part in a conversation in a busy restaurant. These devices, but also telecommunication in noisy environments or voiced-controlled assistants, make use of speech enhancement and separation algorithms that improve the quality and intelligibility of speech by separating speakers and suppressing background noise as well as other unwanted effects such as reverberation. If the devices are equipped with more than one microphone, which is very common nowadays, then multi-channel speech enhancement approaches can leverage spatial information in addition to single-channel tempo-spectral information to perform the task. Traditionally, linear spatial filters, so-called beamformers, have been employed to suppress the signal components from other than the target direction and thereby enhance the desired ...

Tesch, Kristina — Universität Hamburg


Automated Face Recognition from Low-resolution Imagery

Recently, significant advances in the field of automated face recognition have been achieved using computer vision, machine learning, and deep learning methodologies. However, despite claims of super-human performance of face recognition algorithms on select key benchmark tasks, there remain several open problems that preclude the general replacement of human face recognition work with automated systems. State-of-the-art automated face recognition systems based on deep learning methods are able to achieve high accuracy when the face images they are tasked with recognizing subjects from are of sufficiently high quality. However, low image resolution remains one of the principal obstacles to face recognition systems, and their performance in the low-resolution regime is decidedly below human capabilities. In this PhD thesis, we present a systematic study of modern automated face recognition systems in the presence of image degradation in various forms. Based on our ...

Grm, Klemen — University of Ljubljana


Deep Learning of GNSS Signal Detection

Global Navigation Satellite Systems (GNSS) is the de facto technology for Position, Navigation, and Timing (PNT) applications when it is available. GNSS relies on one or more satellite constellations that transmit ranging signals, which a receiver can use to self-localize. Signal acquisition is a crucial step in GNSS receivers, which is typically solved by maximizing the so-called Cross Ambiguity Function (CAF) resulting from a hypothesis testing problem. The CAF is a two-dimensional function that is related to the correlation between the received signal and a local code replica for every possible delay/Doppler pair, which is then maximized for signal detection and coarse synchronization. The outcome of this statistical process decides whether the signal from a particular satellite is present or absent in the received signal, as well as provides a rough estimate of its associated code delay and Doppler frequency, ...

Borhani Darian,Parisa — Northeastern University


Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...

Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz


Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


Non-intrusive Quality Evaluation of Speech Processed in Noisy and Reverberant Environments

In many speech applications such as hands-free telephony or voice-controlled home assistants, the distance between the user and the recording microphones can be relatively large. In such a far-field scenario, the recorded microphone signals are typically corrupted by noise and reverberation, which may severely degrade the performance of speech recognition systems and reduce intelligibility and quality of speech in communication applications. In order to limit these effects, speech enhancement algorithms are typically applied. The main objective of this thesis is to develop novel speech enhancement algorithms for noisy and reverberant environments and signal-based measures to evaluate these algorithms, focusing on solutions that are applicable in realistic scenarios. First, we propose a single-channel speech enhancement algorithm for joint noise and reverberation reduction. The proposed algorithm uses a spectral gain to enhance the input signal, where the gain is computed using a ...

Cauchi, Benjamin — University of Oldenburg


A Geometric Deep Learning Approach to Sound Source Localization and Tracking

The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy ...

Diaz-Guerra, David — University of Zaragoza


Robust Direction-of-Arrival estimation and spatial filtering in noisy and reverberant environments

The advent of multi-microphone setups on a plethora of commercial devices in recent years has generated a newfound interest in the development of robust microphone array signal processing methods. These methods are generally used to either estimate parameters associated with acoustic scene or to extract signal(s) of interest. In most practical scenarios, the sources are located in the far-field of a microphone array where the main spatial information of interest is the direction-of-arrival (DOA) of the plane waves originating from the source positions. The focus of this thesis is to incorporate robustness against either lack of or imperfect/erroneous information regarding the DOAs of the sound sources within a microphone array signal processing framework. The DOAs of sound sources is by itself important information, however, it is most often used as a parameter for a subsequent processing method. One of the ...

Chakrabarty, Soumitro — Friedrich-Alexander Universität Erlangen-Nürnberg


Bayesian data fusion for distributed learning

This dissertation explores the intersection of data fusion, federated learning, and Bayesian methods, with a focus on their applications in indoor localization, GNSS, and image processing. Data fusion involves integrating data and knowledge from multiple sources. It becomes essential when data is only available in a distributed fashion or when different sensors are used to infer a quantity of interest. Data fusion typically includes raw data fusion, feature fusion, and decision fusion. In this thesis, we will concentrate on feature fusion. Distributed data fusion involves merging sensor data from different sources to estimate an unknown process. Bayesian framework is often used because it can provide an optimal and explainable feature by preserving the full distribution of the unknown given the data, called posterior, over the estimated process at each agent. This allows for easy and recursive merging of sensor data ...

Peng Wu — Northeastern University


Modeling and Digital Mitigation of Transmitter Imperfections in Radio Communication Systems

To satisfy the continuously growing demands for higher data rates, modern radio communication systems employ larger bandwidths and more complex waveforms. Furthermore, radio devices are expected to support a rich mixture of standards such as cellular networks, wireless local-area networks, wireless personal area networks, positioning and navigation systems, etc. In general, a "smart'' device should be flexible to support all these requirements while being portable, cheap, and energy efficient. These seemingly conflicting expectations impose stringent radio frequency (RF) design challenges which, in turn, call for their proper understanding as well as developing cost-effective solutions to address them. The direct-conversion transceiver architecture is an appealing analog front-end for flexible and multi-standard radio systems. However, it is sensitive to various circuit impairments, and modern communication systems based on multi-carrier waveforms such as Orthogonal Frequency Division Multiplexing (OFDM) and Orthogonal Frequency Division Multiple ...

Kiayani, Adnan — Tampere University of Technology


Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems

The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a ...

Mingote, Victoria — University of Zaragoza


Data-driven Speech Enhancement: from Non-negative Matrix Factorization to Deep Representation Learning

In natural listening environments, speech signals are easily distorted by variousacoustic interference, which reduces the speech quality and intelligibility of human listening; meanwhile, it makes difficult for many speech-related applications, such as automatic speech recognition (ASR). Thus, many speech enhancement (SE) algorithms have been developed in the past decades. However, most current SE algorithms are difficult to capture underlying speech information (e.g., phoneme) in the SE process. This causes it to be challenging to know what specific information is lost or interfered with in the SE process, which limits the application of enhanced speech. For instance, some SE algorithms aimed to improve human listening usually damage the ASR system. The objective of this dissertation is to develop SE algorithms that have the potential to capture various underlying speech representations (information) and improve the quality and intelligibility of noisy speech. This ...

Xiang, Yang — Aalborg University, Capturi A/S

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.