Optimization and evaluation of a virtual artificial head for individual dynamic spatial sound reproduction over headphones

The ability of humans to perceive sound spatially is based on binaural hearing, i.e. on signals arriving at the two ears which supply the listener with important spatial and spectral cues. The aim of binaural technology is to capture and reproduce the sound field in such a way that these cues are preserved. A well-known drawback of using artificial heads for this aim is that they exhibit different anthropometrical measures compared to individual listeners. When playing back the recorded signals over headphones, the non-individual design of artificial heads may lead to localization ambiguities such as front-back reversals and perception inside the head. Moreover, it is hardly possible to achieve dynamic signal playback, accounting for the listener’s head movements. As an alternative, it has been proposed to use a Virtual Artificial Head (VAH which is a microphone array where spectral weights are applied to the microphone signals, aiming at synthesizing the directivity pattern of Head Related Transfer Functions (HRTFs). By adjusting the spectral weights to HRTFs of individual listeners, the signals recorded with a VAH can be individualized post-hoc for different listeners. In addition, the spectral weights can be adapted to account for the listener’s head movements during signal playback. The aim of this thesis is to improve the performance of a state-of-the-art VAH approach for synthesizing individual HRTF directivity patterns and to evaluate it for situations which have not been considered before. The first focus is to improve the horizontal spatial resolution of the VAH synthesis using a limited number of microphones. The second focus is to investigate the impact of the microphone array topology on the VAH performance in the horizontal plane. The third focus is to evaluate the VAH approach in dynamic auralizations for horizontal and non-horizontal sources, both in anechoic as well as in reverberant environments. First, we propose a new constrained optimization method to calculate the spectral weights, which allows to increase the spatial resolution of the VAH synthesis in the horizontal plane using a limited number of microphones. In addition to imposing a constraint on the mean White Noise Gain (WNG) to increase robustness, we propose to impose constraints on the monaural spectral error, referred to as spectral distortion, at a high number of directions. For a simulated planar microphone array with 24 microphones, we show that the frequency range, for which the synthesis accuracy can be considered acceptable, can be increased from 2 kHz to 5 kHz compared to imposing only the mean WNG constraint. The VAH synthesis with the additional spectral distortion constraints is also shown to perceptually outperform the synthesis where only the mean WNG constraint is imposed. Second, based on simulations with four different microphone array topologies, we investigate the impact of array extension and microphone distribution on the VAH performance. While smaller inter-microphone distances enable to satisfy the spectral distortion constraints at higher frequencies, they may cause difficulties in satisfying the mean WNG constraint at low and mid-frequency ranges. For an array topology combining dense and sparse inter-microphone distances, we show that the mean WNG and spectral distortion constraints can be satisfied for frequencies up to 8 kHz without deteriorating the phase accuracy at low frequencies. In addition, the binaural signals generated using the mixed array topology result in the best perceptual ratings compared to the other considered topologies, which result in either more high-frequency spectral distortion or more low-frequency phase inaccuracy. Third, we investigate the performance of the VAH approach for dynamic auralizations with speech signals in two studies, both considering sources in and outside the horizontal plane. Individual Binaural Room Impulse Responses (BRIRs) for different head orientations are synthesized for two VAHs, i.e. a planar array with 24 microphones and a three-dimensional array with 31 microphones. In the first study, we evaluate dynamic auralizations with the synthesized BRIRs for the VAH with 24 microphones in comparison to real (visible) sound source presentations. We show that both in a reverberant as well as in an anechoic environment close-to-reality dynamic auralizations with speech signals can be achieved. In the second study, we evaluate the localization performance of virtual sources generated with both VAHs in the absence of visual cues and in comparison to real hidden sound sources. We show that even in the absence of visual cues, virtual sources generated with both VAHs can be localized with a similar accuracy with respect to azimuth, externalization and the occurrence of front-back reversals as real sources. Interestingly, including only horizontal directions in the calculation of the spectral weights results in a better localization performance compared to including horizontal and non-horizontal directions. Moreover, localization experiments with and without head tracking show the importance of the dynamic presentation on the localization accuracy of virtual sound sources generated with the VAHs. Although individualization is an important capability of the VAH approach, both studies show that the possibility of presenting binaural signals dynamically is the main advantage of the VAH approach over conventional artificial heads.

File Type: pdf
File Size: 10 MB
Publication Year: 2021
Author: Mina Fallahi
Supervisors: Matthias Blau, Simon Doclo
Institution: University of Oldenburg, Germany
Keywords: virtual artificial head, head related transfer functions, binaural technology, HRTF synthesis, directivity patterns, beamformer optimization, dynamic auralization