Making music through real-time voice timbre analysis: machine learning and timbral control

People can achieve rich musical expression through vocal sound -- see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and fulfilling levels of expressive control. Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI keyboard, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments. The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how ...

Stowell, Dan — Queen Mary University of London


Pitch-informed solo and accompaniment separation

This thesis addresses the development of a system for pitch-informed solo and accompaniment separation capable of separating main instruments from music accompaniment regardless of the musical genre of the track, or type of music accompaniment. For the solo instrument, only pitched monophonic instruments were considered in a single-channel scenario where no panning or spatial location information is available. In the proposed method, pitch information is used as an initial stage of a sinusoidal modeling approach that attempts to estimate the spectral information of the solo instrument from a given audio mixture. Instead of estimating the solo instrument on a frame by frame basis, the proposed method gathers information of tone objects to perform separation. Tone-based processing allowed the inclusion of novel processing stages for attack re nement, transient interference reduction, common amplitude modulation (CAM) of tone objects, and for better ...

Cano Cerón, Estefanía — Ilmenau University of Technology


Melody Extraction from Polyphonic Music Signals

Music was the first mass-market industry to be completely restructured by digital technology, and today we can have access to thousands of tracks stored locally on our smartphone and millions of tracks through cloud-based music services. Given the vast quantity of music at our fingertips, we now require novel ways of describing, indexing, searching and interacting with musical content. In this thesis we focus on a technology that opens the door to a wide range of such applications: automatically estimating the pitch sequence of the melody directly from the audio signal of a polyphonic music recording, also referred to as melody extraction. Whilst identifying the pitch of the melody is something human listeners can do quite well, doing this automatically is highly challenging. We present a novel method for melody extraction based on the tracking and characterisation of the pitch ...

Salamon, Justin — Universitat Pompeu Fabra


Adaptive Edge-Enhanced Correlation Based Robust and Real-Time Visual Tracking Framework and Its Deployment in Machine Vision Systems

An adaptive edge-enhanced correlation based robust and real-time visual tracking framework, and two machine vision systems based on the framework are proposed. The visual tracking algorithm can track any object of interest in a video acquired from a stationary or moving camera. It can handle the real-world problems, such as noise, clutter, occlusion, uneven illumination, varying appearance, orientation, scale, and velocity of the maneuvering object, and object fading and obscuration in low contrast video at various zoom levels. The proposed machine vision systems are an active camera tracking system and a vision based system for a UGV (unmanned ground vehicle) to handle a road intersection. The core of the proposed visual tracking framework is an Edge Enhanced Back-propagation neural-network Controlled Fast Normalized Correlation (EE-BCFNC), which makes the object localization stage efficient and robust to noise, object fading, obscuration, and uneven ...

Ahmed, Javed — Electrical (Telecom.) Engineering Department, National University of Sciences and Technology, Rawalpindi, Pakistan.


Modeling Perceived Quality for Imaging Applications

People of all generations are making more and more use of digital imaging systems in their daily lives. The image content rendered by these digital imaging systems largely differs in perceived quality depending on the system and its applications. To be able to optimize the experience of viewers of this content understanding and modeling perceived image quality is essential. Research on modeling image quality in a full-reference framework --- where the original content can be used as a reference --- is well established in literature. In many current applications, however, the perceived image quality needs to be modeled in a no-reference framework at real-time. As a consequence, the model needs to quantitatively predict perceived quality of a degraded image without being able to compare it to its original version, and has to achieve this with limited computational complexity in order ...

Liu, Hantao — Delft University of Technology


Acoustic Event Detection: Feature, Evaluation and Dataset Design

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


Sound Event Detection by Exploring Audio Sequence Modelling

Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing ...

[Pankajakshan], [Arjun] — Queen Mary University of London


Music Language Models for Automatic Music Transcription

Much like natural language, music is highly structured, with strong priors on the likelihood of note sequences. In automatic speech recognition (ASR), these priors are called language models, which are used in addition to acoustic models and participate greatly to the success of today's systems. However, in Automatic Music Transcription (AMT), ASR's musical equivalent, Music Language Models (MLMs) are rarely used. AMT can be defined as the process of extracting a symbolic representation from an audio signal, describing which notes were played at what time. In this thesis, we investigate the design of MLMs using recurrent neural networks (RNNs) and their use for AMT. We first look into MLM performance on a polyphonic prediction task. We observe that using musically-relevant timesteps results in desirable MLM behaviour, which is not reflected in usual evaluation metrics. We compare our model against benchmark ...

Ycart, Adrien — Queen Mary University of London


Understanding and Assessing Quality of Experience in Immersive Communications

eXtended Reality (XR) technology, also called Mixed Reality (MR), is in constant development and improvement in terms of hardware and software to offer relevant experiences to users. One of the advances in XR has been the introduction of real visual information in the virtual environment, offering a more natural interaction with the scene and a greater acceptance of technology. Another advance has been achieved with the representation of the scene through a video that covers the entire environment, called 360-degree or omnidirectional video. These videos are acquired by cameras with omnidirectional lenses that cover the 360-degrees of the scene and are generally viewed by users through a head-tracked Head Mounted Display (HMD). Users only visualize a subset of the 360-degree scene, called viewport, which changes with the variations of the viewing direction of the users, determined by the movements of ...

Orduna, Marta — Universidad Politécnica de Madrid


Mixed structural models for 3D audio in virtual environments

In the world of Information and communications technology (ICT), strategies for innovation and development are increasingly focusing on applications that require spatial representation and real-time interaction with and within 3D-media environments. One of the major challenges that such applications have to address is user-centricity, reflecting e.g. on developing complexity-hiding services so that people can personalize their own delivery of services. In these terms, multimodal interfaces represent a key factor for enabling an inclusive use of new technologies by everyone. In order to achieve this, multimodal realistic models that describe our environment are needed, and in particular models that accurately describe the acoustics of the environment and communication through the auditory modality are required. Examples of currently active research directions and application areas include 3DTV and future internet, 3D visual-sound scene coding, transmission and reconstruction and teleconferencing systems, to name but ...

Geronazzo, Michele — University of Padova


Monitoring Infants by Automatic Video Processing

This work has, as its objective, the development of non-invasive and low-cost systems for monitoring and automatic diagnosing specific neonatal diseases by means of the analysis of suitable video signals. We focus on monitoring infants potentially at risk of diseases characterized by the presence or absence of rhythmic movements of one or more body parts. Seizures and respiratory diseases are specifically considered, but the approach is general. Seizures are defined as sudden neurological and behavioural alterations. They are age-dependent phenomena and the most common sign of central nervous system dysfunction. Neonatal seizures have onset within the 28th day of life in newborns at term and within the 44th week of conceptional age in preterm infants. Their main causes are hypoxic-ischaemic encephalopathy, intracranial haemorrhage, and sepsis. Studies indicate an incidence rate of neonatal seizures of 2‰ live births, 11‰ for preterm ...

Cattani Luca — University of Parma (Italy)


Acoustic sensor network geometry calibration and applications

In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization ...

Plinge, Axel — TU Dortmund University


Exploiting Sparsity for Efficient Compression and Analysis of ECG and Fetal-ECG Signals

Over the last decade there has been an increasing interest in solutions for the continuous monitoring of health status with wireless, and in particular, wearable devices that provide remote analysis of physiological data. The use of wireless technologies have introduced new problems such as the transmission of a huge amount of data within the constraint of limited battery life devices. The design of an accurate and energy efficient telemonitoring system can be achieved by reducing the amount of data that should be transmitted, which is still a challenging task on devices with both computational and energy constraints. Furthermore, it is not sufficient merely to collect and transmit data, and algorithms that provide real-time analysis are needed. In this thesis, we address the problems of compression and analysis of physiological data using the emerging frameworks of Compressive Sensing (CS) and sparse ...

Da Poian, Giulia — University of Udine


Towards Automatic Extraction of Harmony Information from Music Signals

In this thesis we address the subject of automatic extraction of harmony information from audio recordings. We focus on chord symbol recognition and methods for evaluating algorithms designed to perform that task. We present a novel six-dimensional model for equal tempered pitch space based on concepts from neo-Riemannian music theory. This model is employed as the basis of a harmonic change detection function which we use to improve the performance of a chord recognition algorithm. We develop a machine readable text syntax for chord symbols and present a hand labelled chord transcription collection of 180 Beatles songs annotated using this syntax. This collection has been made publicly available and is already widely used for evaluation purposes in the research community. We also introduce methods for comparing chord symbols which we subsequently use for analysing the statistics of the transcription collection. ...

Harte, Christopher — Queen Mary, University of London


Realtime and Accurate Musical Control of Expression in Voice Synthesis

In the early days of speech synthesis research, understanding voice production has attracted the attention of scientists with the goal of producing intelligible speech. Later, the need to produce more natural voices led researchers to use prerecorded voice databases, containing speech units, reassembled by a concatenation algorithm. With the outgrowth of computer capacities, the length of units increased, going from diphones to non-uniform units, in the so-called unit selection framework, using a strategy referred to as 'take the best, modify the least'. Today the new challenge in voice synthesis is the production of expressive speech or singing. The mainstream solution to this problem is based on the “there is no data like more data” paradigm: emotionspecific databases are recorded and emotion-specific units are segmented. In this thesis, we propose to restart the expressive speech synthesis problem, from its original voice ...

D' Alessandro, N. — Universite de Mons

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.