Automatic Audio-to-Score Piano Transcription with Deep Neural Networks
Automatic music transcription (AMT) is a core task in music information retrieval, aiming to convert audio recordings of musical performances into human- or machine-readable score formats. AMT has various applications, including music search, analysis, tutoring, and generation. While AMT research has traditionally focused on note-level transcription, i.e., generating mid-level representations such as piano rolls or note sequences, recent years have seen growing interest in score-level transcription. This seeks to produce a musical representation that includes not only notes but also rhythm, voice, and other score annotations.
In this thesis, we present our work on automatic audio-to-score transcription for piano performances. We begin by preparing two datasets, each representing a distinct musical style and level of expressive performance. We then explore two approaches: pipeline-based methods that first predict a note-level representation and subsequently convert it into a score-level format; and holistic methods that directly transcribe audio recordings into score format.
For the pipeline-based methods, we focus on the second stage: converting a note sequence into a score format. We propose a convolutional-recurrent neural network to track beats and extend the model to predict a MIDI score. The model outperformed two commercial software solutions, highlighting the advantage of tracking expressive temporal changes in musical performances.
For holistic methods, we use sequence-to-sequence models, which convert an audio spectrogram into a symbolic music score representation. We first explore RNN-based models using a LilyPond score, and demonstrate that incorporating multitask learning by jointly predicting a piano roll representation improves model performance. We then examine Transformer models, adopting an event-based symbolic score representation. Our results show that Transformers outperform RNNs, and using a long short-term decoding strategy boosts the model’s capacity in both note-level transcription and capturing long-term musical features in score-level transcription.
In summary, this thesis presents our exploration towards audio-to-score piano transcription by investigating two approaches. We hope our work can inspire future work in AMT and related fields such as music generation, music education, and performance analysis.
