Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn?t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such as context-based indexing, health monitoring and smart environments, profit from the techniques developed for AED and results are still far from perfect. For instance, automatic music transcription (AMT) usually needs some corrections by expert musicians, and voice-controlled applications often require you to repeat a voice command or a word to your mobile phone before being understood. This is due to the challenging nature of the AED task. It is more than classifying one event into one of a predefined set of classes as it is pointing out a target event from anything else. In this thesis we focus on two AED applications which, at first sight, seem to be coming from two different worlds. The first is note onset detection (NOD an atomic component for many music applications like fingerprinting for search engines and recommender systems, digital effects or simply AMT. From its name, the target events of NOD are the starting of musical notes. For the second application, howling detection (HD), the target is more of an artifact than a desired enjoyable event as howling is that sort of beep that shows up when a closed feedback loop is created between a microphone and a loudspeaker which occurs frequently in public address (PA) systems and hearing aids (HA). A HD algorithm is expected to produce some sort of activation function signaling the resonant frequency as soon as the howling starts in order to automatically filter it out. Surprisingly, both events share a specific time-frequency pattern, hence the key idea behind the spectral sparsity feature suggested in this work. After introducing AED in Part I, inspired from the work done for NOD, a general 3-step processing scheme is sketched out for detection of pattern-specific events in audio signals. This is followed by a summary of the state-of-the-art methods used for each of the steps. Part I ends by comparing the different metrics and techniques traditionally used for NOD and HD performance evaluation pointing out how unsuitable they are to handle the imbalanced nature of the datasets used in both problems. Moreover, it suggests a framework for a more fair evaluation and more generalized results using precision-recall (PR) curves and k-fold cross-validation scores. The two main parts of the thesis reside in Parts II and III, discussing the challenges and suggesting possible solutions for the two applications of interest, NOD and HD, respectively. The contributions for each can be divided into three groups following from the problems? solution steps: feature design, annotated dataset generation and evaluation enhancement. A feature based on spectral sparsity with two flavors, normalised identification of note onset based on spectral sparsity (NINOS2) and NINOS2-Transposed (NINOS2-T), is suggested for respectively detecting note onsets and howling frequencies. When tested on a dataset of synthetically mixed musical note onsets, NINOS2 outperformed the state-of-the-art NOD feature, Logarithmic Spectral Flux (LSF), for the sustained-strings instruments, pushing the F1-score to cross the 50 % border. This group of pitched non-percussive instruments is quite challenging as they have softer onsets, i.e., slowly building-up2transients. A novel pre-processing step preceding the application of the NINOS detection function, is found to contribute to the performance increase. The pre-processing consists in retaining a subset of frequencies traditionally neglected but found here to be tightly related to onsets. For HD, NINOS2-T marked a higher average area under the PR curve (PR-AUC) than all the standalone HD features found in literature, for both music and speech examples. The performance of NINOS2-T remained the highest when restricting the evaluation to early howling detection. Existing datasets for both problems are relatively limited in terms of quantity and quality. For NOD, the available datasets are mainly manually annotated by two or more experts, limiting their availability due to the expensive annotation process. Moreover the annotation is subjective and note-context dependent. A similar situation exists for HD where datasets are made of recorded and manually annotated howling or sometimes poorly simulated by sinusoidal superposition. Part II starts by introducing a MATLAB tool ?Mix-Notes? which is developed for generating automatically annotated NOD datasets. In Part III, a large HD dataset is created by simulating a closed-loop system, using several acoustic impulse responses (AIRs) to cover a wide range of howling frequencies, and applying the simulated system to different music and speech input files. On top of using those datasets for testing the suggested NOD and HD features, a different NOD experiment is carried out in which a real NOD dataset is augmented using a semi-synthetic dataset, created using the ?Mix-Notes? tool, for training a state-of-the-art data-driven Convolutional Neural Network (CNN) model. This is done to overcome the limited availability of annotated real datasets. When running the experiment on piano excerpts, using two different augmentation strategies, preliminary results show better and more stable performance. To ensure a fair NOD evaluation, a novel parameter, the overall time shift in annotations, is proposed in Part II. While consistently lacking in literature when reporting F1-scores, this parameter is found crucial for making results comparable for different datasets and algorithms. The best-case F1-score can vary drastically when including this overall time shift in annotations parameter and it is found beneficial to use this parameter as a tunable hyperparameter when training a deep data-driven model on datasets that are annotated differently. The performance of HD features is traditionally compared for a subset of howling candidates using the receiver operating characteristic (ROC) metric. The use of howling candidates is intended to differentiate between howling and signal components and results in a fairly well balanced dataset, yet it excludes the detection of early howling and ringing. To overcome this limitation, in Part III, we suggest a novel HD approach considering all frequency bins as howling candidates. Since this yields a highly imbalanced dataset, for which ROC evaluation has been proven unsuitable, we propose to use the PR and PR-AUC evaluation metrics instead. Moreover, the PR assessment used a grid of equidistant thresholds in order to evaluate the HD feature robustness to threshold variations. While searching for answers to the different NOD and HD problems, questions never stopped popping up. Part IV revisits some learned lessons, discusses various open questions and suggests some future steps for further research in the presented topics.

File Type: pdf
File Size: 12 MB
Publication Year: 2020
Author: Mina Mounir
Supervisors: Toon van Waterschoot, Peter Karsmakers
Institution: KU Leuven, ESAT STADIUS
Keywords: Audio processing, Music Information Retrieval, Machine Learning, Evaluation