Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...

Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz


Voice biometric system security: Design and analysis of countermeasures for replay attacks

Voice biometric systems use automatic speaker verification (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoofing attacks (or presentation attacks) is of growing concern and is now well acknowledged by the research community. A spoofing attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount - yet difficult to detect reliably and is the focus of this thesis. This research focuses on the analysis and design of existing and novel countermeasures for replay attack detection in ASV, organised in two major parts. The first part of the thesis investigates existing methods for spoofing detection from several perspectives. I first study the generalisability of hand-crafted features for replay detection that show promising results ...

Bhusan Chettri — Queen Mary University of London


Advanced time-domain methods for nuclear magnetic resonance spectroscopy data analysis

Over the past years magnetic resonance spectroscopy (MRS) has been of significant importance both as a fundamental research technique in different fields, as well as a diagnostic tool in medical environments. With MRS, for example, spectroscopic information, such as the concentrations of chemical substances, can be determined non-invasively. To that end, the signals are first modeled by an appropriate model function and mathematical techniques are subsequently applied to determine the model parameters. In this thesis, signal processing algorithms are developed to quantify in-vivo and ex-vivo MRS signals. These are usually characterized by a poor signal-to-noise ratio, overlapping peaks, deviations from the model function and in some cases the presence of disturbing components (e.g. the residual water in proton spectra). The work presented in this thesis addresses a part of the total effort to provide accurate, efficient and automatic data analysis ...

Vanhamme, Leentje — Katholieke Universiteit Leuven


Diplophonic Voice - Definitions, models, and detection

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been ...

Aichinger, Philipp — Division of Phoniatrics-Logopedics, Department of Otorhinolaryngology, Medical University of Vienna; Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria


Making music through real-time voice timbre analysis: machine learning and timbral control

People can achieve rich musical expression through vocal sound -- see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and fulfilling levels of expressive control. Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI keyboard, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments. The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how ...

Stowell, Dan — Queen Mary University of London


Mixed structural models for 3D audio in virtual environments

In the world of Information and communications technology (ICT), strategies for innovation and development are increasingly focusing on applications that require spatial representation and real-time interaction with and within 3D-media environments. One of the major challenges that such applications have to address is user-centricity, reflecting e.g. on developing complexity-hiding services so that people can personalize their own delivery of services. In these terms, multimodal interfaces represent a key factor for enabling an inclusive use of new technologies by everyone. In order to achieve this, multimodal realistic models that describe our environment are needed, and in particular models that accurately describe the acoustics of the environment and communication through the auditory modality are required. Examples of currently active research directions and application areas include 3DTV and future internet, 3D visual-sound scene coding, transmission and reconstruction and teleconferencing systems, to name but ...

Geronazzo, Michele — University of Padova


Learning from structured EEG and fMRI data supporting the diagnosis of epilepsy

Epilepsy is a neurological condition that manifests in epileptic seizures as a result of an abnormal, synchronous activity of a large group of neurons. Depending on the affected brain regions, seizures produce various severe clinical symptoms. Epilepsy cannot be cured and in many cases is not controlled by medication either. Surgical resection of the region responsible for generating the epileptic seizures might offer remedy for these patients. Electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) measure the changes of brain activity in time over different locations of the brain. As such, they provide valuable information on the nature, the timing and the spatial origin of the epileptic activity. Unfortunately, both techniques record activity of different brain and artefact sources as well. Hence, EEG and fMRI signals are characterised by low signal to noise ratio. Data quality and the vast amount ...

Hunyadi, Borbála — KU Leuven


Prediction and Optimization of Speech Intelligibility in Adverse Conditions

In digital speech-communication systems like mobile phones, public address systems and hearing aids, conveying the message is one of the most important goals. This can be challenging since the intelligibility of the speech may be harmed at various stages before, during and after the transmission process from sender to receiver. Causes which create such adverse conditions include background noise, an unreliable internet connection during a Skype conversation or a hearing impairment of the receiver. To overcome this, many speech-communication systems include speech processing algorithms to compensate for these signal degradations like noise reduction. To determine the effect on speech intelligibility of these signal processing based solutions, the speech signal has to be evaluated by means of a listening test with human listeners. However, such tests are costly and time consuming. As an alternative, reliable and fast machine-driven intelligibility predictors are ...

Taal, Cees — Delft University of Technology


Geometric Approach to Statistical Learning Theory through Support Vector Machines (SVM) with Application to Medical Diagnosis

This thesis deals with problems of Pattern Recognition in the framework of Machine Learning (ML) and, specifically, Statistical Learning Theory (SLT), using Support Vector Machines (SVMs). The focus of this work is on the geometric interpretation of SVMs, which is accomplished through the notion of Reduced Convex Hulls (RCHs), and its impact on the derivation of new, efficient algorithms for the solution of the general SVM optimization task. The contributions of this work is the extension of the mathematical framework of RCHs, the derivation of novel geometric algorithms for SVMs and, finally, the application of the SVM algorithms to the field of Medical Image Analysis and Diagnosis (Mammography). Geometric SVM Framework's extensions: The geometric interpretation of SVMs is based on the notion of Reduced Convex Hulls. Although the geometric approach to SVMs is very intuitive, its usefulness was restricted by ...

Mavroforakis, Michael — University of Athens


Sparse approximation and dictionary learning with applications to audio signals

Over-complete transforms have recently become the focus of a wide wealth of research in signal processing, machine learning, statistics and related fields. Their great modelling flexibility allows to find sparse representations and approximations of data that in turn prove to be very efficient in a wide range of applications. Sparse models express signals as linear combinations of a few basis functions called atoms taken from a so-called dictionary. Finding the optimal dictionary from a set of training signals of a given class is the objective of dictionary learning and the main focus of this thesis. The experimental evidence presented here focuses on the processing of audio signals, and the role of sparse algorithms in audio applications is accordingly highlighted. The first main contribution of this thesis is the development of a pitch-synchronous transform where the frame-by-frame analysis of audio data ...

Barchiesi, Daniele — Queen Mary University of London


Mounir, Mina

It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...

Mina Mounir — KU Leuven, ESAT STADIUS


Central and peripheral mechanisms: a multimodal approach to understanding and restoring human motor control

All human actions involve motor control. Even the simplest movement requires the coordinated recruitment of many muscles, orchestrated by neuronal circuits in the brain and the spinal cord. As a consequence, lesions affecting the central nervous system, such as stroke, can lead to a wide range of motor impairments. While a certain degree of recovery can often be achieved by harnessing the plasticity of the motor hierarchy, patients typically struggle to regain full motor control. In this context, technology-assisted interventions offer the prospect of intense, controllable and quantifiable motor training. Yet, clinical outcomes remain comparable to conventional approaches, suggesting the need for a paradigm shift towards customized knowledge-driven treatments to fully exploit their potential. In this thesis, we argue that a detailed understanding of healthy and impaired motor pathways can foster the development of therapies optimally engaging plasticity. To this ...

Kinany, Nawal — Ecole Polytechnique Fédérale de Lausanne (EPFL)


Modeling and Clustering Analysis of Pulmonary Crackles

The objective of this study is to perform two complementary analyses of pulmonary crackles, i.e. modeling and clustering, in order to interpret crackles in time-frequency domain and also determine the optimal number of crackle types and their characteristics using the modeling parameters. Since the crackles are superimposed on background vesicular sounds, a preprocessing method for the elimination of vesicular sounds from crackle waveform is also proposed for achieving accurate parameterization. The proposed modeling method, i.e. the wavelet network modeling, interprets the transient structure of crackles in the time-frequency space with a small number of components using the time-localization property of wavelets. In modeling analysis, complex Morlet wavelets are selected as transfer functions in the hidden nodes due to both their similarity with the crackle waveforms and their flexibility in the modeling process. Clustering analysis of crackles probe the discrepancies found ...

Yeginer, Mete — Bogazici University


Cross-Lingual Voice Conversion

Cross-lingual voice conversion refers to the automatic transformation of a source speaker’s voice to a target speaker’s voice in a language that the target speaker can not speak. It involves a set of statistical analysis, pattern recognition, machine learning, and signal processing techniques. This study focuses on the problems related to cross-lingual voice conversion by discussing open research questions, presenting new methods, and performing comparisons with the state-of-the-art techniques. In the training stage, a Phonetic Hidden Markov Model based automatic segmentation and alignment method is developed for cross-lingual applications which support textindependent and text-dependent modes. Vocal tract transformation function is estimated using weighted speech frame mapping in more detail. Adjusting the weights, similarity to target voice and output quality can be balanced depending on the requirements of the cross- lingual voice conversion application. A context-matching algorithm is developed to reduce ...

Turk, Oytun — Bogazici University


Radial Basis Function Network Robust Learning Algorithms in Computer Vision Applications

This thesis introduces new learning algorithms for Radial Basis Function (RBF) networks. RBF networks is a feed-forward two-layer neural network used for functional approximation or pattern classification applications. The proposed training algorithms are based on robust statistics. Their theoretical performance has been assessed and compared with that of classical algorithms for training RBF networks. The applications of RBF networks described in this thesis consist of simultaneously modeling moving object segmentation and optical flow estimation in image sequences and 3-D image modeling and segmentation. A Bayesian classifier model is used for the representation of the image sequence and 3-D images. This employs an energy based description of the probability functions involved. The energy functions are represented by RBF networks whose inputs are various features drawn from the images and whose outputs are objects. The hidden units embed kernel functions. Each kernel ...

Bors, Adrian G. — Aristotle University of Thessaloniki

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.