Music Language Models for Automatic Music Transcription
Much like natural language, music is highly structured, with strong priors on the likelihood of note sequences. In automatic speech recognition (ASR these priors are called language models, which are used in addition to acoustic models and participate greatly to the success of today’s systems. However, in Automatic Music Transcription (AMT), ASR’s musical equivalent, Music Language Models (MLMs) are rarely used. AMT can be defined as the process of extracting a symbolic representation from an audio signal, describing which notes were played at what time. In this thesis, we investigate the design of MLMs using recurrent neural networks (RNNs) and their use for AMT. We first look into MLM performance on a polyphonic prediction task. We observe that using musically-relevant timesteps results in desirable MLM behaviour, which is not reflected in usual evaluation metrics. We compare our model against benchmark MLMs. We propose new intrinsic metrics to capture these aspects, and combine them into a parametric loss. We show that the loss parameters influence the behaviour of the model in consistent patterns, and that tuning them can improve AMT performance. In particular, we find no relation between MLM cross-entropy, the most commonly-used training loss, and the F-measure used in AMT evaluations. We then investigate various methods to refine the outputs of acoustic models. First, we train neural networks to match acoustic model outputs to ground-truth binary piano rolls, a task we call transduction. In a first experiment, we propose the use of musically-relevant processing steps, and show that they yield better results than time-constant timesteps, although most of the improvement comes from durations being rhythmically quantised. We investigate various neural architectures and training methods for this task, and show that our proposed Convolutional Neural Network (CNN) architecture trained with the F-measure loss results in the greatest improvement. We also use symbolic MLMs to refine the outputs of an acoustic model, by favouring output sequences deemed more likely by the MLM. We propose a novel blending model to dynamically combine the predictions from the acoustic model and MLM. We compare the use of beat-related timesteps against timesteps of a fixed duration, showing that using beat-related timesteps improves results, even when using noisy, automatically-detected beat positions. Finally, we investigate the perceptual relevance of common AMT metrics. We conduct an online listening test to assess the similarity between benchmark AMT system outputs and the original input. We look into the agreement between ratings and AMT metrics, showing that while agreement is high in clear-cut cases, it drops when the difference in metrics is lower. We propose a new evaluation metric trained to approximate ratings based on newly-proposed musical features, showing significantly better agreement with ratings than previous metrics.
