Deep Neural Network-based Approaches for Single-channel Speaker-conditioned Target Speaker Extraction
In everyday communication scenarios, such as meetings and social gatherings, undesired interfering speakers and background noise often degrade the quality and intelligibility of the desired target speaker. Various approaches have been developed to address this issue, such as blind source separation and speaker-conditioned target speaker extraction (SC-TSE). SC-TSE algorithms aim at extracting the desired speaker from the mixture by utilizing auxiliary information about the target speaker, such as reference speech, visual information, directional information, or speaker activity. A typical SC-TSE system consists of a speaker embedder network and a speaker separator network. The speaker embedder network generates target speaker-specific discriminative features from the auxiliary information, which guides the speaker separator network to extract the target speaker from the mixture. The aim of this thesis is to develop and evaluate novel DNN-based architectures to enhance the reliability, efficiency and robustness of single-channel SC-TSE algorithms utilizing reference speech as auxiliary information.
First, we propose three novel variants of long short-term memory (LSTM) cells for target speaker extraction in the time-frequency domain. These customized LSTM cells are specifically designed for the SC-TSE task, by optimizing how target speaker information is retained and updated within the LSTM cells. The first proposed variant customizes only the forget gate, enabling the selective retention of target speaker information while disregarding information from other sources in the mixture. The second proposed variant extends the first variant by customizing both the input and forget gates, enhancing the update mechanism of the cell state to reinforce target speaker-specific feature retention. The third proposed variant introduces an additional auxiliary-modulation gate within the LSTM cell, designed to dynamically learn both long-term and short-term speaker-specific feature discrimination. Experimental results on various mixture types show that all proposed variants of LSTM cells outperform standard LSTM cells in both unidirectional and bidirectional modes. The best performance is obtained using the auxiliary-gated LSTM cells, which yield scale-invariant signal-to-distortion ratio (SI-SDR) improvements up to 1.14 dB (unidirectional mode) and 1.09 dB (bidirectional mode) compared to standard LSTM cells.
Second, we propose two conformer-based architectures for target speaker extraction in the time domain. The first proposed architecture, Conformer-FFN, uses stacks of conformer and external feed-forward blocks, aiming at exploiting both local and global context features using conformer blocks, while reducing the overall number of parameters using external feed-forward blocks. The second proposed architecture, TCN-Conformer, uses stacks of temporal convolutional network (TCN) and conformer blocks, aiming at utilizing the best local context features using TCN blocks and then exploiting both local and global context features using conformer blocks. Experimental results on various mixture types show that the proposed TCN-Conformer system outperforms the TCN-based baseline system and the proposed Conformer-FFN system. The best performance is obtained with four stacks of the TCN and conformer blocks, which yields SI-SDR improvements up to 2.64 dB over the TCN-based baseline and up to 3.44 dB over the Conformer-FFN system. To make the proposed TCN-Conformer system more suitable for real-time target speaker extraction, we replace the traditional multi-head self-attention (MHSA) in each conformer block of the speaker separator network with linear MHSA. Experimental results show that the TCN-Conformer system using linear MHSA outperforms the TCN-Conformer system using traditional MHSA, while achieving a significant reduction in computational cost and real-time factor. In addition, we show that using multi-condition training, it is possible to increase the robustness against background noise, reverberation and intrinsic variability (emotions) in the reference speech of the target speaker.
Third, we subjectively evaluate the performance of two SC-TSE algorithms by performing listening tests with normal-hearing (NH) and hearing-impaired (HI) listeners: an algorithm performing target speaker extraction using a real-valued mask in the time-frequency domain (Algo-1) and an algorithm performing target speaker extraction in the time domain using the proposed TCN-Conformer architecture (Algo-2). These algorithms were evaluated for challenging acoustic scenarios with up to two interfering speakers using three subjective evaluation methods: paired comparison, speech recognition thresholds (SRTs), and categorically scaled perceived listening effort. The results with fifteen NH and fifteen HI listeners show that Algo-2 significantly reduces listening effort, improves speech intelligibility, and is preferred compared to the unprocessed mixtures and Algo-1. Moreover, HI listeners experience greater benefits compared to NH listeners, e.g., in terms of listening effort, a reduction of 7-8 units (ESCU) compared to 4-5 units for NH listeners. For HI listeners with symmetric mild-to-moderate hearing loss, the results also suggest that hearing loss compensation is not necessary to obtain an algorithm benefit.
