Information-Theoretic Measures of Predictability for Music Content Analysis

This thesis is concerned with determining similarity in musical audio, for the purpose of applications in music content analysis. With the aim of determining similarity, we consider the problem of representing temporal structure in music. To represent temporal structure, we propose to compute information-theoretic measures of predictability in sequences. We apply our measures to track-wise representations obtained from musical audio; thereafter we consider the obtained measures predictors of musical similarity. We demonstrate that our approach benefits music content analysis tasks based on musical similarity. For the intermediate-specificity task of cover song identification, we compare contrasting discrete-valued and continuous-valued measures of pairwise predictability between sequences. In the discrete case, we devise a method for computing the normalised compression distance (NCD) which accounts for correlation between sequences. We observe that our measure improves average performance over NCD, for sequential compression algorithms. In ...

Foster, Peter — Queen Mary University of London


Melody Extraction from Polyphonic Music Signals

Music was the first mass-market industry to be completely restructured by digital technology, and today we can have access to thousands of tracks stored locally on our smartphone and millions of tracks through cloud-based music services. Given the vast quantity of music at our fingertips, we now require novel ways of describing, indexing, searching and interacting with musical content. In this thesis we focus on a technology that opens the door to a wide range of such applications: automatically estimating the pitch sequence of the melody directly from the audio signal of a polyphonic music recording, also referred to as melody extraction. Whilst identifying the pitch of the melody is something human listeners can do quite well, doing this automatically is highly challenging. We present a novel method for melody extraction based on the tracking and characterisation of the pitch ...

Salamon, Justin — Universitat Pompeu Fabra


Automated audio captioning with deep learning methods

In the audio research field, the majority of machine learning systems focus on recognizing a limited number of sound events. However, when a machine interacts with real data, it must be able to handle much more varied and complex situations. To tackle this problem, annotators use natural language, which allows any sound information to be summarized. Automated Audio Captioning (AAC) was introduced recently to develop systems capable of automatically producing a description of any type of sound in text form. This task concerns all kinds of sound events such as environmental, urban, domestic sounds, sound effects, music or speech. This type of system could be used by people who are deaf or hard of hearing, and could improve the indexing of large audio databases. In the first part of this thesis, we present the state of the art of the ...

Labbé, Étienne — IRIT


A Geometric Deep Learning Approach to Sound Source Localization and Tracking

The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy ...

Diaz-Guerra, David — University of Zaragoza


Some Contributions to Music Signal Processing and to Mono-Microphone Blind Audio Source Separation

For humans, the sound is valuable mostly for its meaning. The voice is spoken language, music, artistic intent. Its physiological functioning is highly developed, as well as our understanding of the underlying process. It is a challenge to replicate this analysis using a computer: in many aspects, its capabilities do not match those of human beings when it comes to speech or instruments music recognition from the sound, to name a few. In this thesis, two problems are investigated: the source separation and the musical processing. The first part investigates the source separation using only one Microphone. The problem of sources separation arises when several audio sources are present at the same moment, mixed together and acquired by some sensors (one in our case). In this kind of situation it is natural for a human to separate and to recognize ...

Schutz, Antony — Eurecome/Mobile


A Computational Framework for Sound Segregation in Music Signals

Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to ex- tract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical ar- rangement, the sound sources and events occurring in a complex musical mixture, the song structure (e.g. verse, chorus, bridge) and the musical genre of a piece, are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues ...

Martins, Luis Gustavo — Universidade do Porto


Towards Automatic Extraction of Harmony Information from Music Signals

In this thesis we address the subject of automatic extraction of harmony information from audio recordings. We focus on chord symbol recognition and methods for evaluating algorithms designed to perform that task. We present a novel six-dimensional model for equal tempered pitch space based on concepts from neo-Riemannian music theory. This model is employed as the basis of a harmonic change detection function which we use to improve the performance of a chord recognition algorithm. We develop a machine readable text syntax for chord symbols and present a hand labelled chord transcription collection of 180 Beatles songs annotated using this syntax. This collection has been made publicly available and is already widely used for evaluation purposes in the research community. We also introduce methods for comparing chord symbols which we subsequently use for analysing the statistics of the transcription collection. ...

Harte, Christopher — Queen Mary, University of London


Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...

Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz


Audio Watermarking, Steganalysis Using Audio Quality Metrics, and Robust Audio Hashing

We propose a technique for the problem of detecting the very presence of hidden messages in an audio object. The detector is based on the characteristics of the denoised residuals of the audio file. Our proposition is established upon the presupposition that the hidden message in a cover object leaves statistical evidence that can be detected with the use of some audio distortion measures. The distortions caused by hidden message are measured in terms of objective and perceptual quality metrics. The detector discriminates between cover and stego files using a selected subset of features and an SVM classifier. We have evaluated the detection performance of the proposed steganalysis technique with the well-known watermarking and steganographic methods. We present novel and robust audio fingerprinting techniques based on the summarization of the time-frequency spectral characteristics of an audio object. The perceptual hash ...

Ozer, Hamza — Bogazici University


Music Language Models for Automatic Music Transcription

Much like natural language, music is highly structured, with strong priors on the likelihood of note sequences. In automatic speech recognition (ASR), these priors are called language models, which are used in addition to acoustic models and participate greatly to the success of today's systems. However, in Automatic Music Transcription (AMT), ASR's musical equivalent, Music Language Models (MLMs) are rarely used. AMT can be defined as the process of extracting a symbolic representation from an audio signal, describing which notes were played at what time. In this thesis, we investigate the design of MLMs using recurrent neural networks (RNNs) and their use for AMT. We first look into MLM performance on a polyphonic prediction task. We observe that using musically-relevant timesteps results in desirable MLM behaviour, which is not reflected in usual evaluation metrics. We compare our model against benchmark ...

Ycart, Adrien — Queen Mary University of London


Algorithmic Analysis of Complex Audio Scenes

In this thesis, we examine the problem of algorithmic analysis of complex audio scenes with a special emphasis on natural audio scenes. One of the driving goals behind this work is to develop tools for monitoring the presence of animals in areas of interest based on their vocalisations. This task, which often occurs in the evaluation of nature conservation measures, leads to a number of subproblems in audio scene analysis. In order to develop and evaluate pattern recognition algorithms for animal sounds, a representative collection of such sounds is necessary. Building such a collection is beyond the scope of a single researcher and we therefore use data from the Animal Sound Archive of the Humboldt University of Berlin. Although a large portion of well annotated recordings from this archive has been available in digital form, little infrastructure for searching and ...

Bardeli, Rolf — University of Bonn


Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


Decompositions Parcimonieuses Structurees: Application a la presentation objet de la musique

The amount of digital music available both on the Internet and by each listener has considerably raised for about ten years. The organization and the accessibillity of this amount of data demand that additional informations are available, such as artist, album and song names, musical genre, tempo, mood or other symbolic or semantic attributes. Automatic music indexing has thus become a challenging research area. If some tasks are now correctly handled for certain types of music, such as automatic genre classification for stereotypical music, music instrument recoginition on solo performance and tempo extraction, others are more difficult to perform. For example, automatic transcription of polyphonic signals and instrument ensemble recognition are still limited to some particular cases. The goal of our study is not to obain a perfect transcription of the signals and an exact classification of all the instruments ...

Leveau, Pierre — Universite Pierre et Marie Curie, Telecom ParisTech


Guitar Tablature Generation with Deep Learning

The burgeoning of deep learning-based music generation has overlooked the potential of symbolic representations tailored for fretted instruments. Guitar tablatures offer an advantageous approach to represent prescriptive information about music performance, often missing from standard MIDI representations. This dissertation tackles a gap in symbolic music generation by developing models that predict both musical structures and expressive guitar performance techniques. We first present DadaGP, a dataset comprising over 25k songs converted from the Guitar Pro tablature format to a dedicated token format suiting sequence models such as the Transformer. To establish a benchmark, we first introduce a baseline unconditional model for guitar tablature generation, by training a Transformer-XL architecture on the DadaGP dataset. We explored various architecture configurations and experimented with two different tokenisation approaches. Delving into controllability of the generative process, we introduce methods for manipulating the output's instrumentation (inst-CTRL) ...

Sarmento, Pedro — Queen Mary University of London


Online Machine Learning for Graph Topology Identi fication from Multiple Time Series

High dimensional time series data are observed in many complex systems. In networked data, some of the time series are influenced by other time series. Identifying these relations encoded in a graph structure or topology among the time series is of paramount interest in certain applications since the identifi ed structure can provide insights about the underlying system and can assist in inference tasks. In practice, the underlying topology is usually sparse, that is, not all the participating time series influence each other. The goal of this dissertation pertains to study the problem of sparse topology identi fication under various settings. Topology identi fication from time series is a challenging task. The first major challenge in topology identi fication is that the assumption of static topology does not hold always in practice since most of the practical systems are evolving ...

Zaman, Bakht — University of Agder, Norway

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.