Automated audio captioning with deep learning methods (2024)
Sound Event Detection by Exploring Audio Sequence Modelling
Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing ...
[Pankajakshan], [Arjun] — Queen Mary University of London
Deep Learning for Audio Effects Modeling
Audio effects modeling is the process of emulating an audio effect unit and seeks to recreate the sound, behaviour and main perceptual features of an analog reference device. Audio effect units are analog or digital signal processing systems that transform certain characteristics of the sound source. These transformations can be linear or nonlinear, time-invariant or time-varying and with short-term and long-term memory. Most typical audio effect transformations are based on dynamics, such as compression; tone such as distortion; frequency such as equalization; and time such as artificial reverberation or modulation based audio effects. The digital simulation of these audio processors is normally done by designing mathematical models of these systems. This is often difficult because it seeks to accurately model all components within the effect unit, which usually contains mechanical elements together with nonlinear and time-varying analog electronics. Most existing ...
Martínez Ramírez, Marco A — Queen Mary University of London
Automated Face Recognition from Low-resolution Imagery
Recently, significant advances in the field of automated face recognition have been achieved using computer vision, machine learning, and deep learning methodologies. However, despite claims of super-human performance of face recognition algorithms on select key benchmark tasks, there remain several open problems that preclude the general replacement of human face recognition work with automated systems. State-of-the-art automated face recognition systems based on deep learning methods are able to achieve high accuracy when the face images they are tasked with recognizing subjects from are of sufficiently high quality. However, low image resolution remains one of the principal obstacles to face recognition systems, and their performance in the low-resolution regime is decidedly below human capabilities. In this PhD thesis, we present a systematic study of modern automated face recognition systems in the presence of image degradation in various forms. Based on our ...
Grm, Klemen — University of Ljubljana
Acoustic Event Detection: Feature, Evaluation and Dataset Design
It takes more time to think of a silent scene, action or event than finding one that emanates sound. Not only speaking or playing music but almost everything that happens is accompanied with or results in one or more sounds mixed together. This makes acoustic event detection (AED) one of the most researched topics in audio signal processing nowadays and it will probably not see a decline anywhere in the near future. This is due to the thirst for understanding and digitally abstracting more and more events in life via the enormous amount of recorded audio through thousands of applications in our daily routine. But it is also a result of two intrinsic properties of audio: it doesn’t need a direct sight to be perceived and is less intrusive to record when compared to image or video. Many applications such ...
Mina Mounir — KU Leuven, ESAT STADIUS
Single-channel source separation for radio-frequency (RF) systems is a challenging problem relevant to key applications, including wireless communications, radar, and spectrum monitoring. This thesis addresses the challenge by focusing on data-driven approaches for source separation, leveraging datasets of sample realizations when source models are not explicitly provided. To this end, deep learning techniques are employed as function approximations for source separation, with models trained using available data. Two problem abstractions are studied as benchmarks for our proposed deep-learning approaches. Through a simplified problem involving Orthogonal Frequency Division Multiplexing (OFDM), we reveal the limitations of existing deep learning solutions and suggest modifications that account for the signal modality for improved performance. Further, we study the impact of time shifts on the formulation of an optimal estimator for cyclostationary Gaussian time series, serving as a performance lower bound for evaluating data-driven methods. ...
Lee, Cheng Feng Gary — Massachusetts Institute of Technology
Time-domain music source separation for choirs and ensembles
Music source separation is the task of separating musical sources from an audio mixture. It has various direct applications including automatic karaoke generation, enhancing musical recordings, and 3D-audio upmixing; but also has implications for other downstream music information retrieval tasks such as multi-instrument transcription. However, the majority of research has focused on fixed stem separation of vocals, drums, and bass stems. While such models have highlighted capabilities of source separation using deep learning, their implications are limited to very few use cases. Such models are unable to separate most other instruments due to insufficient training data. Moreover, class-based separation inherently limits the applicability of such models to be unable to separate monotimbral mixtures. This thesis focuses on separating musical sources without requiring timbral distinction among the sources. Preliminary attempts focus on the separation of vocal harmonies from choral ensembles using ...
Sarkar, Saurjya — Queen Mary University of London
The present doctoral thesis aims towards the development of new long-term, multi-channel, audio-visual processing techniques for the analysis of bioacoustics phenomena. The effort is focused on the study of the physiology of the gastrointestinal system, aiming at the support of medical research for the discovery of gastrointestinal motility patterns and the diagnosis of functional disorders. The term "processing" in this case is quite broad, incorporating the procedures of signal processing, content description, manipulation and analysis, that are applied to all the recorded bioacoustics signals, the auxiliary audio-visual surveillance information (for the monitoring of experiments and the subjects' status), and the extracted audio-video sequences describing the abdominal sound-field alterations. The thesis outline is as follows. The main objective of the thesis, which is the technological support of medical research, is presented in the first chapter. A quick problem definition is initially ...
Dimoulas, Charalampos — Department of Electrical and Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece
Contributions to Human Motion Modeling and Recognition using Non-intrusive Wearable Sensors
This thesis contributes to motion characterization through inertial and physiological signals captured by wearable devices and analyzed using signal processing and deep learning techniques. This research leverages the possibilities of motion analysis for three main applications: to know what physical activity a person is performing (Human Activity Recognition), to identify who is performing that motion (user identification) or know how the movement is being performed (motor anomaly detection). Most previous research has addressed human motion modeling using invasive sensors in contact with the user or intrusive sensors that modify the user’s behavior while performing an action (cameras or microphones). In this sense, wearable devices such as smartphones and smartwatches can collect motion signals from users during their daily lives in a less invasive or intrusive way. Recently, there has been an exponential increase in research focused on inertial-signal processing to ...
Gil-Martín, Manuel — Universidad Politécnica de Madrid
Some Contributions to Music Signal Processing and to Mono-Microphone Blind Audio Source Separation
For humans, the sound is valuable mostly for its meaning. The voice is spoken language, music, artistic intent. Its physiological functioning is highly developed, as well as our understanding of the underlying process. It is a challenge to replicate this analysis using a computer: in many aspects, its capabilities do not match those of human beings when it comes to speech or instruments music recognition from the sound, to name a few. In this thesis, two problems are investigated: the source separation and the musical processing. The first part investigates the source separation using only one Microphone. The problem of sources separation arises when several audio sources are present at the same moment, mixed together and acquired by some sensors (one in our case). In this kind of situation it is natural for a human to separate and to recognize ...
Schutz, Antony — Eurecome/Mobile
Music Language Models for Automatic Music Transcription
Much like natural language, music is highly structured, with strong priors on the likelihood of note sequences. In automatic speech recognition (ASR), these priors are called language models, which are used in addition to acoustic models and participate greatly to the success of today's systems. However, in Automatic Music Transcription (AMT), ASR's musical equivalent, Music Language Models (MLMs) are rarely used. AMT can be defined as the process of extracting a symbolic representation from an audio signal, describing which notes were played at what time. In this thesis, we investigate the design of MLMs using recurrent neural networks (RNNs) and their use for AMT. We first look into MLM performance on a polyphonic prediction task. We observe that using musically-relevant timesteps results in desirable MLM behaviour, which is not reflected in usual evaluation metrics. We compare our model against benchmark ...
Ycart, Adrien — Queen Mary University of London
Deep learning for semantic description of visual human traits
The recent progress in artificial neural networks (rebranded as “deep learning”) has significantly boosted the state-of-the-art in numerous domains of computer vision offering an opportunity to approach the problems which were hardly solvable with conventional machine learning. Thus, in the frame of this PhD study, we explore how deep learning techniques can help in the analysis of one the most basic and essential semantic traits revealed by a human face, namely, gender and age. In particular, two complementary problem settings are considered: (1) gender/age prediction from given face images, and (2) synthesis and editing of human faces with the required gender/age attributes. Convolutional Neural Network (CNN) has currently become a standard model for image-based object recognition in general, and therefore, is a natural choice for addressing the first of these two problems. However, our preliminary studies have shown that the ...
Antipov, Grigory — Télécom ParisTech (Eurecom)
The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a ...
Mingote, Victoria — University of Zaragoza
Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals
When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...
Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz
Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...
Bahari, Mohamad Hasan — KU Leuven
Guitar Tablature Generation with Deep Learning
The burgeoning of deep learning-based music generation has overlooked the potential of symbolic representations tailored for fretted instruments. Guitar tablatures offer an advantageous approach to represent prescriptive information about music performance, often missing from standard MIDI representations. This dissertation tackles a gap in symbolic music generation by developing models that predict both musical structures and expressive guitar performance techniques. We first present DadaGP, a dataset comprising over 25k songs converted from the Guitar Pro tablature format to a dedicated token format suiting sequence models such as the Transformer. To establish a benchmark, we first introduce a baseline unconditional model for guitar tablature generation, by training a Transformer-XL architecture on the DadaGP dataset. We explored various architecture configurations and experimented with two different tokenisation approaches. Delving into controllability of the generative process, we introduce methods for manipulating the output's instrumentation (inst-CTRL) ...
Sarmento, Pedro — Queen Mary University of London
The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.
The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.