Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece ? be it for an academic interest in emulating human perception, or for practical applications ?, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations ? referred to as deep neural networks ? and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ fully-connected neural networks for music and speech detection and to accelerate music similarity measures, and convolutional neural networks for detecting note onsets, musical segment boundaries and singing voice. In doing so, I evaluate both how well and in what way the networks solve the respective tasks. Using the example of singing voice detection, I additionally develop data augmentation methods to learn from only few annotated music pieces, and a recipe to obtain temporally accurate predictions from inaccurate training examples. The results of my work surpass the previous state of the art in all the tasks considered. The learned solutions are similar to existing hand-designed approaches, but are more extensively optimized than possible by hand. Both indicates that the same methods could also yield substantial improvements for other machine listening problems. The self-contained description of my work ? including a thorough introduction to all relevant deep learning and signal processing techniques ? and my contributions to several open-source software projects shall help other researchers and practitioners to accomplish exactly that. In conclusion, this thesis both advances the state of the art in five concrete applications, and, on a higher level, participates in the ongoing democratization of deep learning.

File Type: pdf
File Size: 6 MB
Publication Year: 2017
Author: Schl?ter, Jan
Supervisors: Gerhard Widmer
Institution: Department of Computational Perception, Johannes Kepler University Linz
Keywords: machine learning, deep learning, multilayer perceptron, convolutional neural network, music information retrieval, music detection, speech detection, vocal activity detection, sequence labelling, music similarity estimation, onset detection, event detection, boundary detection, structural segmentation, music segmentation, singing voice detection, data augmentation, weak labels