Forensic Evaluation of the Evidence Using Automatic Speaker Recognition Systems
This Thesis is focused on the use of automatic speaker recognition systems for forensic identification, in what is called forensic automatic speaker recognition. More generally, forensic identification aims at individualization, defined as the certainty of distinguishing an object or person from any other in a given population. This objective is followed by the analysis of the forensic evidence, understood as the comparison between two samples of material, such as glass, blood, speech, etc. An automatic speaker recognition system can be used in order to perform such comparison between some recovered speech material of questioned origin (e.g., an incriminating wire-tapping) and some control speech material coming from a suspect (e.g., recordings acquired in police facilities). However, the evaluation of such evidence is not a trivial issue at all. In fact, the debate about the presentation of forensic evidence in a court of law is currently a hot topic in many scientific and legal fora. The American Daubert rules for the admissibility of the scientific evidence in trials and the evidence of critical errors in positive identification reports for disciplines assumed as error-free have fostered the discussion. From this debate, DNA profiling arises as a model for a scientifically defensible approach in forensic identification, as it meets the most stringent Court admissibility requirements demanding scientific evaluation of the evidence, and testability of procedures. In this Thesis we take into account such requirements in order to adapt forensic automatic speaker recognition to what has been dubbed the coming paradigm shift in forensic identification science. We begin by reviewing related works in the literature concerning automatic speaker recognition and forensic evaluation of the evidence. Then, the experimental framework to be used in this Thesis is described in detail. The widely accepted Speaker Recognition Evaluations (SRE) conducted by the American National Institute of Standards and Technology (NIST) are adopted as the experimental set-up for this Thesis. The databases used for such protocols constitute challenging corpora presenting many different variability factors, simulating the typical conditions of lawful recordings in telephonic networks. As a contribution in this Thesis, a hierarchical methodology for forensic automatic speaker recognition is proposed. This methodology constitutes a powerful tool for practitioners, as it allows transparent and testable forensic identification using the typical score-based automatic speaker recognition systems. We then identify the main factors affecting the methodology proposed in this Thesis. First the elements of the \emph{coming paradigm shift} are analyzed. Then, the common procedures accepted in automatic forensic speaker recognition are also identified. Taking into account all factors, we define the hierarchical methodology, consisting of three different levels of abstraction, namely the discrimination level, the presentation level and the forensic level. The Dissertation then focuses on the description of the levels which compose the proposed hierarchical methodology. First, the discrimination level is addressed. The aim at this level is to yield a discriminating score, as a way of distinguishing whether the speech coming from the suspect and the questioned recording come from the same source or not. Since discrimination has been the aim of automatic speaker recognition in the last decades, we give a definition of the performance of the score derived from the literature in the field. Moreover, we overview and experimentally compare several widely used techniques found in the literature in order to improve the discriminating power of a score set, namely score normalization, session variability compensation and fusion of systems. A novel score normalization technique, namely KL-T-Norm, is presented as a contribution. We experimentally demonstrate that KL-T-Norm increases the discriminating power of other popular score normalization techniques such as T-Norm, as well as it improves its computational efficiency. Next, the presentation level is introduced. The aim at this level is transforming the input score into a likelihood ratio LR$ as a measure of the weight of the evidence, with a meaning of degree of support of the evidence to any of the hypotheses present in the case. This methodology, popularized by DNA profiling, is probabilistic, data-driven and allows to include in a logical way the weight of the evidence into the inferential process in a forensic case. A definition of the accuracy of the evidence evaluation process is then given, introducing the important concept of calibration. Then, a novel assessment methodology based on information theory is reported, where the accuracy of the LR values is expressed in the form of information-theoretical magnitudes, namely empirical cross-entropy (ECE). Also in the presentation level, a comparative study of different LR computation techniques is presented. Among them, we propose a novel method of generative suspect-adapted LR computation. The study shows that the proposed technique improves the discrimination and the calibration of the input scores, by means of the exploitation of the specificities of a given suspect. The proposed technique is also robust to scarcity in the control speech material, a problem which is often found in forensic casework. The presentation level is concluded with an alternative configuration of the proposed methodology in order to consider non-score-based LR computation techniques, common in other forensic areas and recently proposed for automatic speaker recognition. Finally, the last level in the hierarchy is described, namely the forensic level. The aim at this level is considering the court demands and the requirements of the coming paradigm shift in forensic science in order to properly report the weight of the evidence and its accuracy. Two experimental examples illustrate the reporting and presentation of the results from evidence evaluation by means of the proposed information-theoretical assessment methodology. One of these examples has been built making use of the database and systems employed by the Spanish Guardia Civil in real forensic casework. The chapter ends with the demonstration of the adequacy of the proposed methodology for other forensic disciplines, by means of an experimental example of LR-based evidence evaluation using glass and paint analysis.
