Perceptually Motivated Speech Enhancement

Speech Enhancement (SE) is a vital technology for online human communication. Applications of Deep Neural Network (DNN) technologies in concert with traditional signal processing approaches to the task have revolutionised both the research and implementation of SE in recent years. However, the training objective of these Neural Network Speech Enhancement (NNSE) systems generally do not consider the psychoacoustic processing which occurs in the human auditory system. As a result, enhanced audio can often contain auditory artefacts which degrade the perceptual quality or intelligibility of the speech. To overcome this, systems which directly incorporate psychoacoustically motivated measures into the training objectives of NNSE systems have been proposed. A key development in speech audio processing in recent years is the emergence of Self Supervised Speech Representation (SSSR) models. These are powerful foundational DNN models which can be utilised for a number of more specific speech processing tasks, such as speech recognition, emotion detection as well as SE. Finally, the methods of evaluation of SE systems have been revolutionised by DNN technology, that is to say the creation of systems which are able to directly predict Mean Option Score (MOS) ratings of Speech Quality (SQ) or Speech Intelligibility (SI) derived from human listening tests. This thesis aims to investigate these three areas; psychoacoustic training objectives of NNSE, the incorporation of SSSR features and the prediction of human derived labels of speech directly from audio signals. Further, the intersection of these areas and combined use of techniques from these areas will be investigated. A widely adopted approach for psychoacoustically motivated NNSE training is the MetricGAN framework. Here, a NNSE network is trained as generator adversarially (pitted against in competition) with a metric prediction discriminator. The discriminator is tasked with predicting the score assigned to the input audio by a (typically non-differentiable and thus unable to be used as a loss function directly) metric function, while the generator uses inference of the discriminator to obtain a loss value for its outputs. While MetricGAN has proved effective and is becoming a widely adopted technique, there is scope to improve it in several areas. Several of the contributions of this thesis are related to these improvements including the introduction of an additional DNN tasked with improving the range of inputs to the metric prediction Discriminator, changes to the Neural Network (NN) structure of both components and the prediction of non-intrusive measures among others. A key finding of this work is that perceptually motivated NNSE systems tend to overfit towards the target perceptual metric, resulting in degraded ?real world? enhancement performance. The concept of the metric prediction is further developed into systems proposed for the related task of DNN based human MOS prediction. This can be done intrusively meaning that the system has access to a non-distorted version of the signal under test as a reference or non-intrusively meaning that only the signal under test is available. Here, human labels of SQ or SI are directly predicted from the audio signal stimulus. SI prediction is mainly investigated in the domain of hearing aid SE system evaluation in this work. State of the art performance is achieved by SQ prediction systems developed and presented in this work.Two novel applications of SSSR are presented. Firstly, as feature space representations in the loss function of NNSE systems. In particular, it is found that using earlier intermediate DNN layer outputs in this application is particularly effective, and a strong correlation between theSSSR distance measure and psychoacoustic metrics and MOS labels is shown. Secondly, SSSR representations are proposed for use as feature extractors for the discriminator DNN components of the MetricGAN framework, as well as for MOS estimators.

File Type: pdf
File Size: 28 MB
Publication Year: 2025
Author: Close, George
Supervisors: Stefan Goetze, Thomas Hain
Institution: University of Sheffield
Keywords: speech enhancement, neural networks, artificial intelligence, speech quality, speech intelligibility