DNN-Based Own Voice Reconstruction for Hearables with an In-Ear Microphone
In recent years, hearable technology has advanced rapidly, leading to widespread daily use in challenging acoustic environments. As their popularity has grown, so has the demand for high-quality speech communication. Although hearables can capture the user’s own voice with outer microphones, recordings made in noisy conditions typically require processing to enhance speech quality, which can be challenging at high noise levels. Many modern hearables also include an in-ear microphone, which is more robust to environmental noise than the outer microphones because the device partially occludes the ear canal. However, in-ear own voice recordings exhibit characteristic distortions, such as low-frequency amplification and band-limitation, which vary strongly across individuals, change during speech production, and depend on device properties. These effects need to be taken into account when using an in-ear microphone for own voice capture.
The main objective of this thesis is to develop and evaluate causal deep neural network (DNN)-based own voice reconstruction (OVR) approaches that estimate clean broadband speech from noisy outer and in-ear microphone signals. Achieving this objective requires addressing several key challenges: understanding the unique distortions affecting in-ear own voice recordings, reducing the training data requirements of DNN-based OVR systems, meeting realistic computational complexity constraints, identifying suitable objective metrics for OVR performance that correlate well with subjective quality ratings, and investigating the benefits of personalizing OVR systems to individual talkers.
As a first contribution, we propose a phoneme-dependent model of the time-varying relationship between own voice signals recorded by an outer and an in-ear microphone, which we refer to as own voice transfer characteristics. Specifically, the model represents the own voice transfer characteristics as a set of linear time-invariant relative transfer functions, one for each phoneme. Experimental results on recorded own voice signals from 18 talkers demonstrate that the proposed (time-varying) phoneme-dependent model predicts in‑ear own voice signals up to 50% more accurately than time‑invariant models. While individual models yield lower prediction errors for matched talkers than talker-averaged models, talker-averaged models generalize better to unseen talkers.
As a second contribution, we propose data augmentation techniques for training multi-channel DNN-based OVR systems that jointly process the outer and in-ear microphone signals.
The proposed augmentation technique, based on the phoneme-dependent own voice transfer characteristics model, enables the simulation of a large amount of in-ear own voice signals from a clean speech dataset, while requiring only a small amount of recorded own voice signals to identify the transfer characteristics model.
Experimental results for signal-to-noise ratios between -10 dB and 10 dB at the outer microphone show that OVR system trained with phoneme-dependent individual augmentation followed by fine-tuning with recorded signals achieves the best performance, with an average PESQ improvement of 1.3 compared to the noisy outer microphone signal. This performance gain is maintained even when only a few minutes of recorded own voice signals per talker are available to identify the transfer characteristics model. In addition, to meet realistic computational constraints, we investigate low-complexity variants of the proposed DNN-based OVR system (down to 13k parameters), and show that these variants outperform baseline OVR systems at comparable complexity.
As a third contribution, we investigate personalization of OVR systems to individual talkers using two approaches: training-based personalization and enrollment-based personalization. Results from a listening test show that generic (non-personalized) OVR systems substantially improve subjective quality compared to unprocessed noisy outer microphone signals with an average score improvement of 50 MUSHRA points, with personalization providing an additional benefit of up to 5 points for some talkers. A correlation analysis between objective metrics and subjective quality ratings indicates that the intrusive ESTOI metric and the non-intrusive LEAP metric are particularly suitable for assessing OVR performance. For the proposed enrollment-based personalization, an enrollment utterance of the talker recorded with the in-ear microphone is required. Experiments on the Vibravox dataset show that enrollment-based personalization is very effective in scenarios with competing talkers, achieving up to 10 dB SI-SDR improvement over unprocessed signals, and remains robust under dataset mismatch.
In summary, this thesis demonstrates that an OVR system combining an outer and an in-ear microphone can be trained with a small amount of recorded own voice signals by using the proposed phoneme-dependent own voice transfer characteristics models, enabling high-quality OVR for hearables in noisy environments. This is verified by objective metrics and the results of a subjective listening test.
