Structured and Sequential Representations For Human Action Recognition (2013)
Toward sparse and geometry adapted video approximations
Video signals are sequences of natural images, where images are often modeled as piecewise-smooth signals. Hence, video can be seen as a 3D piecewise-smooth signal made of piecewise-smooth regions that move through time. Based on the piecewise-smooth model and on related theoretical work on rate-distortion performance of wavelet and oracle based coding schemes, one can better analyze the appropriate coding strategies that adaptive video codecs need to implement in order to be efficient. Efficient video representations for coding purposes require the use of adaptive signal decompositions able to capture appropriately the structure and redundancy appearing in video signals. Adaptivity needs to be such that it allows for proper modeling of signals in order to represent these with the lowest possible coding cost. Video is a very structured signal with high geometric content. This includes temporal geometry (normally represented by motion ...
Divorra Escoda, Oscar — EPFL / Signal Processing Institute
The analysis of audiovisual data aims at extracting high level information, equivalent with the one(s) that can be extracted by a human. It is considered as a fundamental, unsolved (in its general form) problem. Even though the inverse problem, the audiovisual (sound and animation) synthesis, is judged easier than the previous, it remains an unsolved problem. The systematic research on these problems yields solutions that constitute the basis for a great number of continuously developing applications. In this thesis, we examine the two aforementioned fundamental problems. We propose algorithms and models of analysis and synthesis of articulated motion and undulatory (snake) locomotion, using data from video sequences. The goal of this research is the multilevel information extraction from video, like object tracking and activity recognition, and the 3-D animation synthesis in virtual environments based on the results of analysis. An ...
Panagiotakis, Costas — University of Crete
Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech
One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to ...
Sun, Meng — Katholieke Universiteit Leuven
Video Based Detection of Driver Fatigue
This thesis addresses the problem of drowsy driver detection using computer vision techniques applied to the human face. Specifically we explore the possibility of discriminating drowsy from alert video segments using facial expressions automatically extracted from video. Several approaches were previously proposed for the detection and prediction of drowsiness. There has recently been increasing interest in computer vision approaches as it is a potentially promising approach due to its non-invasive nature for detecting drowsiness. Previous studies with vision based approaches detect driver drowsiness primarily by making pre-assumptions about the relevant behavior, focusing on blink rate, eye closure, and yawning. Here we employ machine learning to explore, understand and exploit actual human behavior during drowsiness episodes. We have collected two datasets including facial and head movement measures. Head motion is collected through an accelerometer for the first dataset (UYAN-1) and an ...
Vural, Esra — Sabanci University
Information-Theoretic Measures of Predictability for Music Content Analysis
This thesis is concerned with determining similarity in musical audio, for the purpose of applications in music content analysis. With the aim of determining similarity, we consider the problem of representing temporal structure in music. To represent temporal structure, we propose to compute information-theoretic measures of predictability in sequences. We apply our measures to track-wise representations obtained from musical audio; thereafter we consider the obtained measures predictors of musical similarity. We demonstrate that our approach benefits music content analysis tasks based on musical similarity. For the intermediate-specificity task of cover song identification, we compare contrasting discrete-valued and continuous-valued measures of pairwise predictability between sequences. In the discrete case, we devise a method for computing the normalised compression distance (NCD) which accounts for correlation between sequences. We observe that our measure improves average performance over NCD, for sequential compression algorithms. In ...
Foster, Peter — Queen Mary University of London
This dissertation deals with the distributed processing techniques for parameter estimation and efficient data-gathering in wireless communication and sensor networks. The estimation problem consists in inferring a set of parameters from temporal and spatial noisy observations collected by different nodes that monitor an area or field. The objective is to derive an estimate that is as accurate as the one that would be obtained if each node had access to the information across the entire network. With the aim of enabling an energy aware and low-complexity distributed implementation of the estimation task, several useful optimization techniques that generally yield linear estimators were derived in the literature. Up to now, most of the works considered that the nodes are interested in estimating the same vector of global parameters. This scenario can be viewed as a special case of a more general ...
Bogdanovic, Nikola — University of Patras
Sound Event Detection by Exploring Audio Sequence Modelling
Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing ...
[Pankajakshan], [Arjun] — Queen Mary University of London
Search-Based Methods for the Sparse Signal Recovery Problem in Compressed Sensing
The sparse signal recovery, which appears not only in compressed sensing but also in other related problems such as sparse overcomplete representations, denoising, sparse learning, etc. has drawn a large attraction in the last decade. The literature contains a vast number of recovery methods, which have been analysed in theoretical and empirical aspects. This dissertation presents novel search-based sparse signal recovery methods. First, we discuss theoretical analysis of the orthogonal matching pursuit algorithm with more iterations than the number of nonzero elements of the underlying sparse signal. Second, best-fi rst tree search is incorporated for sparse recovery by a novel method, whose tractability follows from the properly de fined cost models and pruning techniques. The proposed method is evaluated by both theoretical and empirical analyses, which clearly emphasize the improvements in the recovery accuracy. Next, we introduce an iterative two ...
Karahanoglu, Nazim Burak — Sabanci University
Video person recognition strategies using head motion and facial appearance
In this doctoral dissertation, we principally explore the use of the temporal information available in video sequences for person and gender recognition; in particular, we focus on the analysis of head and facial motion, and their potential application as biometric identifiers. We also investigate how to exploit as much video information as possible for the automatic recognition; more precisely, we examine the possibility of integrating the head and mouth motion information with facial appearance into a multimodal biometric system, and we study the extraction of novel spatio-temporal facial features for recognition. We initially present a person recognition system that exploits the unconstrained head motion information, extracted by tracking a few facial landmarks in the image plane. In particular, we detail how each video sequence is firstly pre-processed by semiautomatically detecting the face, and then automatically tracking the facial landmarks over ...
Matta, Federico — Eurécom / Multimedia communications
The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a ...
Mingote, Victoria — University of Zaragoza
Informed spatial filters for speech enhancement
In modern devices which provide hands-free speech capturing functionality, such as hands-free communication kits and voice-controlled devices, the received speech signal at the microphones is corrupted by background noise, interfering speech signals, and room reverberation. In many practical situations, the microphones are not necessarily located near the desired source, and hence, the ratio of the desired speech power to the power of the background noise, the interfering speech, and the reverberation at the microphones can be very low, often around or even below 0 dB. In such situations, the comfort of human-to-human communication, as well as the accuracy of automatic speech recognisers for voice-controlled applications can be signi cantly degraded. Therefore, e ffective speech enhancement algorithms are required to process the microphone signals before transmitting them to the far-end side for communication, or before feeding them into a speech recognition ...
Taseska, Maja — Friedrich-Alexander Universität Erlangen-Nürnberg
Deep learning for semantic description of visual human traits
The recent progress in artificial neural networks (rebranded as “deep learning”) has significantly boosted the state-of-the-art in numerous domains of computer vision offering an opportunity to approach the problems which were hardly solvable with conventional machine learning. Thus, in the frame of this PhD study, we explore how deep learning techniques can help in the analysis of one the most basic and essential semantic traits revealed by a human face, namely, gender and age. In particular, two complementary problem settings are considered: (1) gender/age prediction from given face images, and (2) synthesis and editing of human faces with the required gender/age attributes. Convolutional Neural Network (CNN) has currently become a standard model for image-based object recognition in general, and therefore, is a natural choice for addressing the first of these two problems. However, our preliminary studies have shown that the ...
Antipov, Grigory — Télécom ParisTech (Eurecom)
Distributed Stochastic Optimization in Non-Differentiable and Non-Convex Environments
The first part of this dissertation considers distributed learning problems over networked agents. The general objective of distributed adaptation and learning is the solution of global, stochastic optimization problems through localized interactions and without information about the statistical properties of the data. Regularization is a useful technique to encourage or enforce structural properties on the resulting solution, such as sparsity or constraints. A substantial number of regularizers are inherently non-smooth, while many cost functions are differentiable. We propose distributed and adaptive strategies that are able to minimize aggregate sums of objectives. In doing so, we exploit the structure of the individual objectives as sums of differentiable costs and non-differentiable regularizers. The resulting algorithms are adaptive in nature and able to continuously track drifts in the problem; their recursions, however, are subject to persistent perturbations arising from the stochastic nature of ...
Vlaski, Stefan — University of California, Los Angeles
Vision-based human activities recognition in supervised or assisted environment
Human Activity Recognition HAR has been a hot research topic in the last decade due to its wide range of applications. Indeed, it has been the basis for implementa- tion of many computer vision applications, home security, video surveillance, and human-computer interaction. We intend by HAR, tools, and systems allowing to detect and recognize actions performed by individuals. With the considerable progress made in sensing technologies, HAR systems shifted from wearable and ambient-based to vision-based. This motivated the researchers to propose a large mass of vision-based solutions. From another perspective, HAR plays an impor- tant role in the health care sector and gets involved in the construction of fall detection systems and many smart home-related systems. Fall detection FD con- sists in identifying the occurrence of falls among other daily life activities. This is essential because falling is one of ...
Beddiar Djamila Romaissa — Université De Larbi Ben M’hidi Oum EL Bouaghi, Algeria
Deep Learning for Event Detection, Sequence Labelling and Similarity Estimation in Music Signals
When listening to music, some humans can easily recognize which instruments play at what time or when a new musical segment starts, but cannot describe exactly how they do this. To automatically describe particular aspects of a music piece – be it for an academic interest in emulating human perception, or for practical applications –, we can thus not directly replicate the steps taken by a human. We can, however, exploit that humans can easily annotate examples, and optimize a generic function to reproduce these annotations. In this thesis, I explore solving different music perception tasks with deep learning, a recent branch of machine learning that optimizes functions of many stacked nonlinear operations – referred to as deep neural networks – and promises to obtain better results or require less domain knowledge than more traditional techniques. In particular, I employ ...
Schlüter, Jan — Department of Computational Perception, Johannes Kepler University Linz
