Geometry-aware sound source localization using neural networks
Sound Source Localization (SSL) is the topic within acoustic signal processing which studies methods for the estimation of the position of one or more active sound sources in space, such as human talkers, using signals captured by one or more microphone arrays. It has many applications, including robot orientation, speech enhancement and diarization. Although signal processing-based algorithms have been the standard choice for SSL over past decades, deep neural networks have recently achieved state-of-the-art performance for this task. A drawback of most deep learning-based SSL methods consists of requiring the training and testing microphone and room geometry to be matched, restricting practical applications of available models. This is particularly relevant when using Distributed Microphone Arrays (DMAs), whose positions are usually set arbitrarily and may change with time. Flexibility to microphone geometry is also desirable for companies maintaining multiple types of microphone arrays in their line of products, and smaller companies or practicioners who wish to apply freely available pre-trained, off-the-shelf SSL models to their applications. The main contributions of this thesis are the creation of a novel class of neural network models for the tasks of Positional Sound Source Localization (PSSL) and Direction-of-Arrival (DOA) estimation, named Neural-SRP. The method combines concepts from graph neural networks as well as from the classical Steered Response Power (SRP) localization method. Unlike current state of the art networks for SSL, the Neural-SRP method is able to function on microphone arrays and rooms of arbitrary geometry, while maintaining or improving localization performance.
