Flexible Multi-Microphone Acquisition and Processing of Spatial Sound Using Parametric Sound Field Representations

This thesis deals with the efficient and flexible acquisition and processing of spatial sound using multiple microphones. In spatial sound acquisition and processing, we use multiple microphones to capture the sound of multiple sources being simultaneously active at a rever- berant recording side and process the sound depending on the application at the application side. Typical applications include source extraction, immersive spatial sound reproduction, or speech enhancement. A flexible sound acquisition and processing means that we can capture the sound with almost arbitrary microphone configurations without constraining the application at the ap- plication side. This means that we can realize and adjust the different applications indepen- dently of the microphone configuration used at the recording side. For example in spatial sound reproduction, where we aim at reproducing the sound such that the listener perceives the same impression as if he or she was present at the recording side, the listener can freely adjust the loudspeaker setup at the far-end side independently of how the sound was recorded at the near-end side. In source extraction, where we aim at extracting sounds from specific preferred directions while attenuating interfering sounds from other directions, the user at the application side can freely adjust the preferred direction and extract the sounds with arbitrary spatial responses, which can be adjusted in real-time. Efficient sound acquisition and processing means that we need to transmit only few au- dio signals, compared to the number of microphones used, from the recording side to the applications side (e.g., via network or storage media while still being able to realize the different applications with the flexibility mentioned before. This includes that the recording side has to deal with the major computational load which enables low-power and battery- driven devices at the application side. Alternatively, when the computational complexity at the recording side is heavily restricted, we can transmit the microphones signals to the application side at the expense of higher bandwidth or storage capacity required. To realize the efficient and flexible sound acquisition and processing, we use a parametric description of the spatial sound. We assume that for each time and frequency, the sound field at the recording location can be decomposed into a sum of a few direct sound components plus a diffuse sound component, where the direct components model the direct sound of the sources while the diffuse component models the reverberation. In contrast to State-Of-the-Art (SOA) approaches in parametric sound processing, we consider multiple direct components per time and frequency to reduce the model violations which strongly limit the performance of the SOA approaches. The direct sounds together with the diffuse sound and parametric side information, namely the Direction-Of-Arrival (DOA) of the direct sounds, form a general and compact description of the spatial sound which can be efficiently transmitted and from which we can realize the different applications mentioned before. The estimation of the multiple direct sounds and diffuse sound represents one major part of this thesis. The direct sound extraction is carried out using classical single-channel or multi- channel filters. However, these filters are computed using instantaneous information on the underlying parametric sound field model, such as the instantaneous DOA or Diffuse-to-Noise Ratio (DNR). Incorporating this information allows us to obtain filters with the desired spatial response that adapt quickly to changes in the acoustic scene which is paramount in our applications where multiple sources are active at the same time in a reverberant environment. The diffuse sound extraction in the presence of multiple direct sounds is only little addressed in literature and only few single-channel filters are available. Therefore, we develop different optimal single-channel and multi-channel filters which allow us to accurately extract the diffuse sound while reducing the direct sounds and noise. These filters allow us to realize applications where an immersive and natural sound reproduction is highly desired. Computing the different filters requires to estimate specific parameters of the underlying sound field model. These parameters include the number of sources and their DOA, the direct and diffuse Power Spectral Densities (PSDs), or the DNR and Signal-to-Diffuse Ratio (SDR). The estimation of these parameters represents a second major part of the thesis. The proposed estimators can be efficiently implemented in our parametric framework and provide a higher accuracy than related SOA approaches. The last part of the thesis deals with the applications that can be realized with the para- metric representation of the spatial sound. We discuss the application to source extraction, immersive spatial sound reproduction, and acoustical zooming. This part of the thesis also contains an extensive evaluation of the different estimators and filters based on simulations and measured data and including listening tests. The experimental results show that with the proposed estimators and filters we can outperform SOA approaches while still obtaining a similar efficiency and flexibility. This enables a huge variety of different applications on up- coming devices such as modern mobile phones, tablets, or television screens, which nowadays are equipped with multiple microphones and connected via network.

File Type: pdf
File Size: 6 MB
Publication Year: 2015
Author: Thiergart, Oliver
Supervisors: Emanu?l Habets
Institution: Friedrich-Alexander-Universitat Erlangen-Nurnberg
Keywords: acoustic signal processing, acquisition, reproduction, speech enhancement, source separation, dereverberation, noise reduction