Deep Learning Techniques for Visual Counting

The explosion of Deep Learning (DL) added a boost to the already rapidly developing field of Computer Vision to such a point that vision-based tasks are now parts of our everyday lives. Applications such as image classification, photo stylization, or face recognition are nowadays pervasive, as evidenced by the advent of modern systems trivially integrated into mobile applications. In this thesis, we investigated and enhanced the visual counting task, which automatically estimates the number of objects in still images or video frames. Recently, due to the growing interest in it, several Convolutional Neural Network (CNN)-based solutions have been suggested by the scientific community. These artificial neural networks, inspired by the organization of the animal visual cortex, provide a way to automatically learn effective representations from raw visual data and can be successfully employed to address typical challenges characterizing this task, ...

Ciampi Luca — University of Pisa


Automated Face Recognition from Low-resolution Imagery

Recently, significant advances in the field of automated face recognition have been achieved using computer vision, machine learning, and deep learning methodologies. However, despite claims of super-human performance of face recognition algorithms on select key benchmark tasks, there remain several open problems that preclude the general replacement of human face recognition work with automated systems. State-of-the-art automated face recognition systems based on deep learning methods are able to achieve high accuracy when the face images they are tasked with recognizing subjects from are of sufficiently high quality. However, low image resolution remains one of the principal obstacles to face recognition systems, and their performance in the low-resolution regime is decidedly below human capabilities. In this PhD thesis, we present a systematic study of modern automated face recognition systems in the presence of image degradation in various forms. Based on our ...

Grm, Klemen — University of Ljubljana


General Approaches for Solving Inverse Problems with Arbitrary Signal Models

Ill-posed inverse problems appear in many signal and image processing applications, such as deblurring, super-resolution and compressed sensing. The common approach to address them is to design a specific algorithm, or recently, a specific deep neural network, for each problem. Both signal processing and machine learning tactics have drawbacks: traditional reconstruction strategies exhibit limited performance for complex signals, such as natural images, due to the hardness of their mathematical modeling; while modern works that circumvent signal modeling by training deep convolutional neural networks (CNNs) suffer from a huge performance drop when the observation model used in training is inexact. In this work, we develop and analyze reconstruction algorithms that are not restricted to a specific signal model and are able to handle different observation models. Our main contributions include: (a) We generalize the popular sparsity-based CoSaMP algorithm to any signal ...

Tirer, Tom — Tel Aviv University


Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals

When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...

Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz


A Geometric Deep Learning Approach to Sound Source Localization and Tracking

The localization and tracking of sound sources using microphone arrays is a problem that, even if it has attracted attention from the signal processing research community for decades, remains open. In recent years, deep learning models have surpassed the state-of-the-art that had been established by classic signal processing techniques, but these models still struggle with handling rooms with strong reverberations or tracking multiple sources that dynamically appear and disappear, especially when we cannot apply any criteria to classify or order them. In this thesis, we follow the ideas of the Geometric Deep Learning framework to propose new models and techniques that mean an advance of the state-of-the-art in the aforementioned scenarios. As the input of our models, we use acoustic power maps computed using the SRP-PHAT algorithm, a classic signal processing technique that allows us to estimate the acoustic energy ...

Diaz-Guerra, David — University of Zaragoza


Facial Soft Biometrics: Methods, Applications and Solutions

This dissertation studies soft biometrics traits, their applicability in different security and commercial scenarios, as well as related usability aspects. We place the emphasis on human facial soft biometric traits which constitute the set of physical, adhered or behavioral human characteristics that can partially differentiate, classify and identify humans. Such traits, which include characteristics like age, gender, skin and eye color, the presence of glasses, moustache or beard, inherit several advantages such as ease of acquisition, as well as a natural compatibility with how humans perceive their surroundings. Specifically, soft biometric traits are compatible with the human process of classifying and recalling our environment, a process which involves constructions of hierarchical structures of different refined traits. This thesis explores these traits, and their application in soft biometric systems (SBSs), and specifically focuses on how such systems can achieve different goals ...

Dantcheva, Antitza — EURECOM / Telecom ParisTech


Multi-channel EMG pattern classification based on deep learning

In recent years, a huge body of data generated by various applications in domains like social networks and healthcare have paved the way for the development of high performance models. Deep learning has transformed the field of data analysis by dramatically improving the state of the art in various classification and prediction tasks. Combined with advancements in electromyography it has given rise to new hand gesture recognition applications, such as human computer interfaces, sign language recognition, robotics control and rehabilitation games. The purpose of this thesis is to develop novel methods for electromyography signal analysis based on deep learning for the problem of hand gesture recognition. Specifically, we focus on methods for data preparation and developing accurate models even when few data are available. Electromyography signals are in general one-dimensional time-series with a rich frequency content. Various feature sets have ...

Tsinganos, Panagiotis — University of Patras, Greece - Vrije Universiteit Brussel, Belgium


Contributions to Human Motion Modeling and Recognition using Non-intrusive Wearable Sensors

This thesis contributes to motion characterization through inertial and physiological signals captured by wearable devices and analyzed using signal processing and deep learning techniques. This research leverages the possibilities of motion analysis for three main applications: to know what physical activity a person is performing (Human Activity Recognition), to identify who is performing that motion (user identification) or know how the movement is being performed (motor anomaly detection). Most previous research has addressed human motion modeling using invasive sensors in contact with the user or intrusive sensors that modify the user’s behavior while performing an action (cameras or microphones). In this sense, wearable devices such as smartphones and smartwatches can collect motion signals from users during their daily lives in a less invasive or intrusive way. Recently, there has been an exponential increase in research focused on inertial-signal processing to ...

Gil-Martín, Manuel — Universidad Politécnica de Madrid


Discrete-time speech processing with application to emotion recognition

The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance ...

Kotti, Margarita — Aristotle University of Thessaloniki


Learning Transferable Knowledge through Embedding Spaces

The unprecedented processing demand, posed by the explosion of big data, challenges researchers to design efficient and adaptive machine learning algorithms that do not require persistent retraining and avoid learning redundant information. Inspired from learning techniques of intelligent biological agents, identifying transferable knowledge across learning problems has been a significant research focus to improve machine learning algorithms. In this thesis, we address the challenges of knowledge transfer through embedding spaces that capture and store hierarchical knowledge. In the first part of the thesis, we focus on the problem of cross-domain knowledge transfer. We first address zero-shot image classification, where the goal is to identify images from unseen classes using semantic descriptions of these classes. We train two coupled dictionaries which align visual and semantic domains via an intermediate embedding space. We then extend this idea by training deep networks that ...

Mohammad Rostami — University of Pennsylvania


Deep Learning for i-Vector Speaker and Language Recognition

Over the last few years, i-vectors have been the state-of-the-art technique in speaker and language recognition. Recent advances in Deep Learning (DL) technology have improved the quality of i-vectors but the DL techniques in use are computationally expensive and need speaker or/and phonetic labels for the background data, which are not easily accessible in practice. On the other hand, the lack of speaker-labeled background data makes a big performance gap, in speaker recognition, between two well-known cosine and Probabilistic Linear Discriminant Analysis (PLDA) i-vector scoring techniques. It has recently been a challenge how to fill this gap without speaker labels, which are expensive in practice. Although some unsupervised clustering techniques are proposed to estimate the speaker labels, they cannot accurately estimate the labels. This thesis tries to solve the problems above by using the DL technology in different ways, without ...

Ghahabi, Omid — Universitat Politecnica de Catalunya


Automatic Speaker Characterization; Identification of Gender, Age, Language and Accent from Speech Signals

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic ...

Bahari, Mohamad Hasan — KU Leuven


Voice biometric system security: Design and analysis of countermeasures for replay attacks

Voice biometric systems use automatic speaker verification (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoofing attacks (or presentation attacks) is of growing concern and is now well acknowledged by the research community. A spoofing attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount - yet difficult to detect reliably and is the focus of this thesis. This research focuses on the analysis and design of existing and novel countermeasures for replay attack detection in ASV, organised in two major parts. The first part of the thesis investigates existing methods for spoofing detection from several perspectives. I first study the generalisability of hand-crafted features for replay detection that show promising results ...

Bhusan Chettri — Queen Mary University of London


Visual ear detection and recognition in unconstrained environments

Automatic ear recognition systems have seen increased interest over recent years due to multiple desirable characteristics. Ear images used in such systems can typically be extracted from profile head shots or video footage. The acquisition procedure is contactless and non-intrusive, and it also does not depend on the cooperation of the subjects. In this regard, ear recognition technology shares similarities with other image-based biometric modalities. Another appealing property of ear biometrics is its distinctiveness. Recent studies even empirically validated existing conjectures that certain features of the ear are distinct for identical twins. This fact has significant implications for security-related applications and puts ear images on a par with epigenetic biometric modalities, such as the iris. Ear images can also supplement other biometric modalities in automatic recognition systems and provide identity cues when other information is unreliable or even unavailable. In ...

Emeršič, Žiga — University of Ljubljana, Faculty of Computer and Information Science


Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems

The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a ...

Mingote, Victoria — University of Zaragoza

The current layout is optimized for mobile phones. Page previews, thumbnails, and full abstracts will remain hidden until the browser window grows in width.

The current layout is optimized for tablet devices. Page previews and some thumbnails will remain hidden until the browser window grows in width.