Automatic Audio-to-Score Piano Transcription with Deep Neural Networks

Automatic music transcription (AMT) is a core task in music information retrieval, aiming to convert audio recordings of musical performances into human- or machine-readable score formats. AMT has various applications, including music search, analysis, tutoring, and generation. While AMT research has traditionally focused on note-level transcription, i.e., generating mid-level representations such as piano rolls or note sequences, recent years have seen growing interest in score-level transcription. This seeks to produce a musical representation that includes not only notes but also rhythm, voice, and other score annotations.

In this thesis, we present our work on automatic audio-to-score transcription for piano performances. We begin by preparing two datasets, each representing a distinct musical style and level of expressive performance. We then explore two approaches: pipeline-based methods that first predict a note-level representation and subsequently convert it into a score-level format; and holistic methods that directly transcribe audio recordings into score format.

For the pipeline-based methods, we focus on the second stage: converting a note sequence into a score format. We propose a convolutional-recurrent neural network to track beats and extend the model to predict a MIDI score. The model outperformed two commercial software solutions, highlighting the advantage of tracking expressive temporal changes in musical performances.

For holistic methods, we use sequence-to-sequence models, which convert an audio spectrogram into a symbolic music score representation. We first explore RNN-based models using a LilyPond score, and demonstrate that incorporating multitask learning by jointly predicting a piano roll representation improves model performance. We then examine Transformer models, adopting an event-based symbolic score representation. Our results show that Transformers outperform RNNs, and using a long short-term decoding strategy boosts the model’s capacity in both note-level transcription and capturing long-term musical features in score-level transcription.

In summary, this thesis presents our exploration towards audio-to-score piano transcription by investigating two approaches. We hope our work can inspire future work in AMT and related fields such as music generation, music education, and performance analysis.

File Type: pdf
File Size: 4 MB
Publication Year: 2026
Author : Lele Liu
Supervisors : Emmanouil Benetos, Veronica Morfi, Simon Dixon
Institution : Queen Mary University of London
Keywords : automatic music transcription, audio-to-score, deep learning