Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems

The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a global average layer as pooling mechanism in DNN architectures. Since the order of the phonetic information is relevant in text-dependent speaker verification task, whether a global average pooling is employed, this order is neglected and the results achieved for the verification performance metrics are too high. Hence, the first approach proposed in this thesis is an alignment mechanism which is used to replace the global average pooling. This alignment mechanism allows to keep the temporal structure and to encode the utterance and speaker identity in a supervector. As alignment mechanism, different types of approaches such as Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) can be used. Moreover, during the development of this mechanism, we also noted that the lack of larger training databases is another important issue to create these systems. Therefore, we have also introduced a new architecture philosophy based on the Knowledge Distillation (KD) approach. This architecture is known as teacher-student architecture and provides robustness to the systems during the training process and against possible overfitting due to the lack of data. In this part, another alternative approach is proposed to focus on the relevant frames of the sequence and maintain the phonetic information, which consists of Multi-head Self-Attention (MSA). The architecture proposed to use the MSA layers also introduces phonetic embeddings and memory layers to improve the discrimination between speakers and utterances. Moreover, to complete the architecture with the previous techniques, another approach has been incorporated where two learnable vectors have been introduced which are called class and distillation tokens. Using these tokens during training, temporal information is kept and encoded into the tokens, so that at the end, a global utterance descriptor similar to the supervector is obtained. Apart from the above approaches to obtain robust representations, the other main part of this thesis has focused on introducing new loss functions to train DNN architectures. Traditional loss functions have provided reasonably good results for many tasks, but there are not usually designed to optimize the goal task. For this reason, we have proposed several new loss functions as objective for training DNN architectures which are based on the final verification metrics. The first approach developed for this part is inspired by the Area Under the ROC Curve (AUC). Thus, we have presented a differentiable approximation of this metric called aAUC loss to successfully train a triplet neural network as back-end. However, the selection of the training data has to be carefully done to carry out this back-end, so it involves a high computational cost. Therefore, we have developed several approaches to take advantage of training with a loss function oriented to the goal task but keeping the efficiency and speed of multi-class training. To implement these approaches, the differentiable approximation of the Detection Cost Function (aDCF ) and Cost of Log-Likelihood Ratio (CLLR) verification metrics have been employed as training objective loss. By optimizing DNN architectures to minimize these loss functions, the system learns to reduce errors in decisions and scores produced. The use of these approaches has also shown a better ability to learn more general representations than training with other traditional loss functions. Finally, we have also proposed a new straightforward back-end that employs the information learned by the matrix of the last layer of DNN architecture during training with aDCF loss. Using the matrix of this last layer, an enrollment model with a learnable vector is trained for each enrollment identity to perform the verification process.

File Type: pdf
File Size: 8 MB
Publication Year: 2022
Author: Mingote, Victoria
Supervisors: Antonio Miguel
Institution: University of Zaragoza
Keywords: Biometric Systems, Speaker Verification, Face Verification, Metric Learning, Representation Learning, Deep Neural Networks, Artificial Intelligence, Advanced Machine Learning, Signal Processing