Spatial Room Impulse Response Processing for Virtual Acoustics
Augmented reality (AR) and telepresence systems aim to enhance the real world with virtual elements that blend convincingly into the surrounding space. Creating virtual sound sources in this context requires presenting perceptually valid head-related and room-acoustic cues to the listener to enable a realistic spatial impression and a coherent match between the virtual acoustics and those of the physical environment. In practical AR systems, the acoustic characteristics of the environment must be estimated from available sensor signals and the virtual source rendered through acoustically transparent headphones to preserve natural sounds in the physical environment. This thesis addresses both stages of this virtual acoustic processing chain: estimation and rendering. Central to both are spatial room impulse responses (SRIRs), which describe the linear, time-invariant, and directional properties of the acoustic transfer path between a source and a receiver in an environment.
The thesis first introduces a general microphone array signal model that separates room- and array-dependent contributions using spherical or circular harmonic representations. Building on this model, a blind SRIR estimation framework is proposed that reformulates blind multichannel system identification as an informed problem through the estimation of a pseudo-reference signal. Motivated by practical AR systems that often rely on wearable devices such as head-mounted displays or smartglasses, the thesis then specifically considers microphone arrays in motion.
The second part of the thesis focuses on the binaural rendering of estimated SRIRs for headphone reproduction. An array-aware end-to-end magnitude least-squares renderer is proposed to mitigate spatio-spectral coloration caused by limited spatial sampling and regularization. As an alternative to direct rendering, the thesis investigates the separation of direct sound and early reflections from an SRIR, a common processing step in parametric SRIR-based rendering that can facilitate virtual acoustic reproduction with increased directional sharpness. Two approaches are compared: one based on a physical array signal model and another based on subspace decomposition.
Together, these contributions advance practical SRIR estimation and rendering for virtual acoustics and provide foundations for robust, wearable, and perceptually convincing augmented and virtual reality audio systems.
