Decision threshold estimation and model quality evaluation techniques for speaker verification
The number of biometric applications has increased a lot in the last few years. In this context, the automatic person recognition by some physical traits like fingerprints, face, voice or iris, plays an important role. Users demand this type of applications every time more and the technology seems already mature. People look for security, low cost and accuracy but, at the same time, there are many other factors in connection with biometric applications that are growing in importance. Intrusiveness is undoubtedly a burning factor to decide about the biometrics we will used for our application. At this point, one can realize about the suitability of speaker recognition because voice is the natural way of communicating, can be remotely used and provides a low cost. Automatic speaker recognition is commonly used in telephonic applications although it can also be used in physical access control or in forensics. Speaker verification and speaker identification have several stages. First of all, one can find the parameterization stage of the voice signal, where the signal is processed to be modeled or compared. After that, we find the model estimation if we are training or the decision stage if we are making a comparison. This PhD is focused on the training and the decision stages of a speaker verification system. In these kind of systems, the result of the comparison between a utterance and a model depends on the decision threshold. The speaker is accepted if the obtained score is above the threshold and rejected if below. On the other hand, the quality of the utterances used to train the model will have a high influence on the performance. The way of detecting low quality utterances is also studied in this PhD. In real applications, it is common to have only a few data to estimate the model and the decision threshold. Furthermore, the non-availability of impostor material is also a negative aspect. The lack of data makes that low quality utterances or background noises have a great impact on performance. In this PhD, a new speaker-dependent threshold estimation method based only on client data and a method to detect outliers are introduced. Furthermore, new quality evaluation methods are also proposed. One interesting way of determining the quality of the utterances consists of detecting quality on-line, during training. By using this method, new quality utterances from the same speaker can be automatically replaced, in the same training session. In order to test the proposed algorithms and methods, a speaker recognition database has been recorded. It is a multi-session database in Spanish with 184 speakers. It is called BioTech and has been especially designed for speaker recognition. Finally, a case study about a real speaker verification application is introduced. Some techniques developed in this PhD have been used there. The application consists of a remote certification revocation by voice.
