Some Contributions to Machine Learning-based System Identification and Speech Enhancement for Nonlinear Acoustic Echo Control
Given the widespread use of miniaturized audio interfaces, echo control systems are faced with increasing challenges to address a large variety of acoustic conditions observed by such interfaces. This motivates the use of sophisticated machine learning-based techniques to overcome the limitations of conventional methods. The contributions in this thesis can be outlined by decomposing the task of nonlinear acoustic echo control into two subtasks: Nonlinear Acoustic Echo Cancellation (NAEC) and Acoustic Echo Suppression (AES). In particular, by formulating the single-channel NAEC model-adaptation task as a Bayesian recursive filtering problem, an evolutionary resampling strategy for particle filtering is proposed. The resulting Elitist Resampling Particle Filter (ERPF) is shown experimentally to be an efficient and high-performing approach that can be extended to address challenging conditions such as non-stationary interferers. The fundamental problem of nonlinear model design is addressed by proposing a novel Artificial Neural Networks (ANNs)-based approach (denoted the Adaptive Filtering-Inspired (AFI) ANN) that learns the optimal nonlinear basis functions to approximate the underlying nonlinear system. Using transfer learning, the learned basis functions are incorporated into conventional nonlinear models. The AFI ANNs are shown to yield consistently better echo cancellation performance than their conventional alternatives for both synthetic and real-world recordings. Extending the ERPF to multichannel nonlinear models enables the adaptation of Nonlinear-in-the-Parameters (NIP) Multiple-Input/Multiple-Output (MIMO) echo path models. This extension is realized using a cooperative strategy which exploits the redundancy in the multichannel system identification problem and enables a geometrically informed approach that can utilize the microphone and loudspeaker array geometries. The resulting Cooperative Multichannel Elitist Resampling Particle Filter (CM-ERPF) is evaluated for both synthetic and real-world nonlinearities where it exhibits better performance at a lower computational complexity than conventional methods. Finally, for AES, a complex-valued Deep Neural Network (DNN) architecture (denoted the CPF) is proposed to estimate a complex-valued mask to extract the desired near-end speech signal. By utilizing complex-valued neural modules, the network is provided the capability of processing and exploiting complex-valued patterns and features such as complex-valued spectrograms. This results in speech signal estimates with minimal distortions and better overall quality when compared to other conventional counterparts. The CPF performance is confirmed for both real-world and synthetic signals that included, e.g., nonlinear distortions and near-end interfering noise sources.
