Advances in Audio Decorrelation and Rendering of Spatially Extended Sound Sources
The aim of immersive spatial audio technologies, as used, e.g., in virtual and augmented reality applications, is to provide the user with an immersive and plausible listening experience. The overall goal is to render the presented three-dimensional sound scenes realistically in a perceptual sense, either over headphones or using a multi-channel loudspeaker setup. Besides a good sound quality, it is essential to consider relevant spatial attributes of the presented sound scenes. One important aspect is the localization of individual sound sources. Additionally, other perceptual aspects of the presented sound scenes need to be considered, including the perceived spatial extent (i.e., ?size?) of a sound source and the perceptual impression of the surrounding environment. From a perceptual point of view, the degree of correlation between the sounds received by the ears is an important factor influencing both the perceived spatial extent of a sound source and the perceptual impression of the surrounding environment. A low correlation is typically associated with an increased size of the auditory event and an enhanced sense of envelopment. This perceptual relevance makes audio decorrelation an important tool within the field of spatial audio rendering to help control the spatial perception of the sound image. This thesis investigates the suitability of neural networks for the task of audio decorrelation. Additionally, methods for binaural rendering of spatially extended sound sources (SESSs) are developed, which employ audio decorrelation techniques. The first part of this thesis deals with neural network-based approaches to audio decorrelation. Since neural network-based approaches have not previously been applied with the goal of audio decorrelation, we first provide a proof of concept. We therefore propose a convolutional neural network (CNN) architecture, which is trained to mimic the behavior of a state-of-the-art reference decorrelator. By means of a formal listening test, we show that the output of the proposed method is perceptually similar to the output of the reference decorrelator. As a next step, a reference-free method based on generative adversarial networks (GANs) is developed. For the generator network, the same CNN architecture is employed. The training objective is defined directly w.r.t. the input signal and consists of a number of individual loss terms to control both the input-output correlation and the output signal quality. The proposed reference-free approach allows to specifically tailor the training procedure to the desired output signal properties. Finally, the proposed GAN-based audio decorrelation method is extended to provide a multi-channel output signal, which is required in the context of multi-channel spatial audio rendering. A separate generator network is employed for each output channel. All generator networks are optimized jointly to obtain output channels that are mutually uncorrelated and exhibit both a low correlation and a high perceptual similarity to the input signal. The second part of this thesis introduces methods for binaural rendering of SESSs that are based on audio decorrelation techniques. SESSs can be characterized by means of their radiation behavior. While homogeneous SESSs emit sound with constant radiation characteristics over their extent, heterogeneous SESSs exhibit a position-dependent radiation behavior. First, a method for efficient rendering of homogeneous SESSs is introduced. By modeling the homogeneous SESS as an incoherently extended sound source with position-independent energy and spectral content, a number of target auditory cues are determined. A binaural output signal with the desired properties is then synthesized by mixing two decorrelated input signals, which can be generated from a single-channel input signal using a single decorrelation filter. Compared to a direct implementation of the rendering model, the proposed approach comes with the advantage of reduced computational complexity and relaxed requirements for the employed decorrelation filters. Second, a method for binaural rendering of heterogeneous SESSs is proposed. Input to the algorithm is a two-channel signal, which provides information about the position-dependent radiation characteristics of the sound source. By extending the homogeneous SESS rendering model to take the position-dependent energy of the sound source into account, a heterogeneous SESS rendering model is defined. Based on this rendering model, the target covariance matrix of the binaural output signal is determined. Using an optimal mixing approach previously proposed in the literature, a binaural output signal with the desired properties is obtained while preserving the spatial characteristics encoded in the two-channel input signal. A formal listening test demonstrates that the output of the proposed method comes close to the simulated binaural reference signal in terms of spatial impression and overall audio quality. Finally, the suitability of the proposed GAN-based audio decorrelation method for the developed homogeneous SESS rendering method is investigated. To improve the overall audio quality of the decorrelated stereo signal, which serves as the basis for the homogeneous SESS rendering method, an additional loss term is introduced to minimize spectral magnitude differences between the channels of the decorrelated stereo signal.
