Perceptually-Based Signal Features for Environmental Sound Classification
This thesis faces the problem of automatically classifying environmental sounds, i.e., any non-speech or non-music sounds that can be found in the environment. Broadly speaking, two main processes are needed to perform such classification: the signal feature extraction so as to compose representative sound patterns and the machine learning technique that performs the classification of such patterns. The main focus of this research is put on the former, studying relevant signal features that optimally represent the sound characteristics since, according to several references, it is a key issue to attain a robust recognition. This type of audio signals holds many differences with speech or music signals, thus specific features should be determined and adapted to their own characteristics. In this sense, new signal features, inspired by the human auditory system and the human perception of sound, are proposed to improve the representation and classification of environmental sound signals. Firstly, in the spectral signal analysis domain, Cepstral coefficients computed with the biologically inspired Gammatone filters are proposed and adapted to environmental sound classification, obtaining the so-called Gammatone Cepstral Coefficients (GTCC). The experimental results show an increase in the classification rates when GTCC are used instead of the standard-de-facto Mel Frequency Cepstral Coefficients (MFCC) to describe any of the different tested environmental sound sets. The improvement is attributed to a better representation of the spectral signal details, especially when those appear at low frequency bands. Secondly, the temporal signal analysis domain is introduced according to the specific characteristics of different environmental sounds. On the one hand, the Gammatone Wavelet coefficients (GTW) are proposed for surveillance-related sounds parameterisation, since they merge the optimum spectral analysis of Gammatone filters with the ability to catch the short duration and impulsive events of Wavelet time-frequency transform. On the other hand, the Narrow-Band Autocorrelation Function (NB-ACF) features are proposed for soundscape signal parameterisation, since they are able to take into account the complex characteristics of such signals that are composed of multiple coexisting sound events. In this case, the NB-ACF features are able to represent non-spectrally overlapped sounds thanks to the detailed analysis (consisting in the parameterisation of the Autocorrelation Function with five perceptually-based descriptors) that is performed in each spectral band. NB-ACF features, especially when combined with Gammatone filter banks, notably outperform MFCC, regardless of the machine learning technique employed. Finally, a particular case is studied, which deals with the classification of the environmental noise sources that affect human?s health and quality of life. Preliminary works flag out the difficulty to distinguish among road vehicle noise sources (car, truck, motorbike). With the goal of improving the classification of such noise sources, a hierarchical classification system that takes into account the different vehicle pass-by phases is proposed. The vehicle pass-by phases refer to the perceptually distinguishable phases in which a vehicle pass-by might be divided into: the approaching, the passing-by and the receding. The proposed scheme, working with Gaussian Mixture Models, is able to yield comparable classification accuracies with respect to a traditional approach employing Hidden Markov Models (a machine learning technique that inherently takes into account the signal time evolution) but with dramatically lower computational cost requirements.
