Automatic Transcription of Polyphonic Music Exploiting Temporal Evolution

Automatic music transcription is the process of converting an audio recording into a symbolic representation using musical notation. It has numerous applications in music information retrieval, computational musicology, and the creation of interactive systems. Even for expert musicians, transcribing polyphonic pieces of music is not a trivial task, and while the problem of automatic pitch estimation for monophonic signals is considered to be solved, the creation of an automated system able to transcribe polyphonic music without setting restrictions on the degree of polyphony and the instrument type still remains open. In this thesis, research on automatic transcription is performed by explicitly incorporating information on the temporal evolution of sounds. First efforts address the problem by focusing on signal processing techniques and by proposing audio features utilising temporal characteristics. Techniques for note onset and offset detection are also utilised for improving ...

Benetos, Emmanouil — Centre for Digital Music, Queen Mary University of London


Some Contributions to Music Signal Processing and to Mono-Microphone Blind Audio Source Separation

For humans, the sound is valuable mostly for its meaning. The voice is spoken language, music, artistic intent. Its physiological functioning is highly developed, as well as our understanding of the underlying process. It is a challenge to replicate this analysis using a computer: in many aspects, its capabilities do not match those of human beings when it comes to speech or instruments music recognition from the sound, to name a few. In this thesis, two problems are investigated: the source separation and the musical processing. The first part investigates the source separation using only one Microphone. The problem of sources separation arises when several audio sources are present at the same moment, mixed together and acquired by some sensors (one in our case). In this kind of situation it is natural for a human to separate and to recognize ...

Schutz, Antony — Eurecome/Mobile


Interactive Real-time Musical Systems

This thesis focuses on the development of automatic accompaniment sys- tems. We investigate previous systems and look at a range of approaches that have been attempted for the problem of beat tracking. Most beat trackers are intended for the purposes of music information retrieval where a ‘black box’ approach is tested on a wide variety of music genres. We highlight some of the difficulties facing offline beat trackers and design a new approach for the problem of real-time drum tracking, developing a system, B-Keeper, which makes reasonable assumptions on the nature of the signal and is provided with useful prior knowledge. Having developed the system with offline studio recordings, we look to test the system with human players. Existing offline evaluation methods seem less suitable for a performance system, since we also wish to evaluate the interaction between musician and ...

Robertson, Andrew — Queen Mary, University of London


Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...

Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz


Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems

The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a ...

Mingote, Victoria — University of Zaragoza


Robust Speech Recognition: Analysis and Equalization of Lombard Effect in Czech Corpora

When exposed to noise, speakers will modify the way they speak in an effort to maintain intelligible communication. This process, which is referred to as Lombard effect (LE), involves a combination of both conscious and subconscious articulatory adjustment. Speech production variations due to LE can cause considerable degradation in automatic speech recognition (ASR) since they introduce a mismatch between parameters of the speech to be recognized and the ASR system’s acoustic models, which are usually trained on neutral speech. The main objective of this thesis is to analyze the impact of LE on speech production and to propose methods that increase ASR system performance in LE. All presented experiments were conducted on the Czech spoken language, yet, the proposed concepts are assumed applicable to other languages. The first part of the thesis focuses on the design and acquisition of a ...

Boril, Hynek — Czech Technical University in Prague


Decompositions Parcimonieuses Structurees: Application a la presentation objet de la musique

The amount of digital music available both on the Internet and by each listener has considerably raised for about ten years. The organization and the accessibillity of this amount of data demand that additional informations are available, such as artist, album and song names, musical genre, tempo, mood or other symbolic or semantic attributes. Automatic music indexing has thus become a challenging research area. If some tasks are now correctly handled for certain types of music, such as automatic genre classification for stereotypical music, music instrument recoginition on solo performance and tempo extraction, others are more difficult to perform. For example, automatic transcription of polyphonic signals and instrument ensemble recognition are still limited to some particular cases. The goal of our study is not to obain a perfect transcription of the signals and an exact classification of all the instruments ...

Leveau, Pierre — Universite Pierre et Marie Curie, Telecom ParisTech


Statistical and Discriminative Language Modeling for Turkish Large Vocabulary Continuous Speech Recognition

Turkish, being an agglutinative language with rich morphology, presents challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. First, the agglutinative nature of Turkish leads to a high number of Out-of Vocabulary (OOV) words which in turn lower Automatic Speech Recognition (ASR) accuracy. Second, Turkish has a relatively free word order that leads to non-robust language model estimates. These challenges have been mostly handled by using meaningful segmentations of words, called sub-lexical units, in language modeling. However, a shortcoming of sub-lexical units is over-generation which needs to be dealt with for higher accuracies. This dissertation aims to address the challenges of Turkish in LVCSR. Grammatical and statistical sub-lexical units for language modeling are investigated and they yield substantial improvements over the word language models. Our novel approach inspired by dynamic vocabulary adaptation mostly recovers the errors caused by over-generation and ...

Arisoy, Ebru — Bogazici University


Towards Automatic Extraction of Harmony Information from Music Signals

In this thesis we address the subject of automatic extraction of harmony information from audio recordings. We focus on chord symbol recognition and methods for evaluating algorithms designed to perform that task. We present a novel six-dimensional model for equal tempered pitch space based on concepts from neo-Riemannian music theory. This model is employed as the basis of a harmonic change detection function which we use to improve the performance of a chord recognition algorithm. We develop a machine readable text syntax for chord symbols and present a hand labelled chord transcription collection of 180 Beatles songs annotated using this syntax. This collection has been made publicly available and is already widely used for evaluation purposes in the research community. We also introduce methods for comparing chord symbols which we subsequently use for analysing the statistics of the transcription collection. ...

Harte, Christopher — Queen Mary, University of London


Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Despite a lot of progress in speech separation, enhancement, and automatic speech recognition realistic meeting recognition is still fairly unsolved. Most research on speech separation either focuses on spectral cues to address single-channel recordings or spatial cues to separate multi-channel recordings and exclusively either rely on neural networks or probabilistic graphical models. Integrating a spatial clustering approach and a deep learning approach using spectral cues in a single framework can significantly improve automatic speech recognition performance and improve generalizability given that a neural network profits from a vast amount of training data while the probabilistic counterpart adapts to the current scene. This thesis at hand, therefore, concentrates on the integration of two fairly disjoint research streams, namely single-channel deep learning-based source separation and multi-channel probabilistic model-based source separation. It provides a general framework to integrate spatial and spectral cues in ...

Drude, Lukas — Paderborn University


Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...

Bahari, Mohamad Hasan — KU Leuven


Improving Speech Recognition for Pluricentric Languages exemplified on Varieties of German

A method is presented to improve speech recognition for pluricentric languages. Both the effect of adaptation of acoustic data and phonetic transcriptions for several subregions of the German speaking area are investigated and discussed. All experiments were carried out for German spoken in Germany and Austria using large telephone databases (Speech-Dat). In the first part triphone-based acoustic models (AMOs) were trained for several regions and their word error rates (WERs) were compared. The WERs vary between 9.89% and 21.78% and demonstrate the importance of regional variety adaptation. In the pronunciation modeling part narrow phonetic transcriptions for a subset of the Austrian database were carried out to derive pronunciation rules for Austrian German and to generate phonetic lexica for Austrian German which are the first of their kind. These lexica were used for both triphone-based and monophone-based AMOs with German and ...

Micha Baum — TU Graz


Identification of versions of the same musical composition by processing audio descriptions

Automatically making sense of digital information, and specially of music digital documents, is an important problem our modern society is facing. In fact, there are still many tasks that, although being easily performed by humans, cannot be effectively performed by a computer. In this work we focus on one of such tasks: the identification of musical piece versions (alternate renditions of the same musical composition like cover songs, live recordings, remixes, etc.). In particular, we adopt a computational approach solely based on the information provided by the audio signal. We propose a system for version identification that is robust to the main musical changes between versions, including timbre, tempo, key and structure changes. Such a system exploits nonlinear time series analysis tools and standard methods for quantitative music description, and it does not make use of a specific modeling strategy ...

Serra, Joan — Universitat Pompeu Fabra


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.