Digital Audio Processing Methods for Voice Pathology Detection
Voice pathology is a diverse field that includes various disorders affecting vocal quality and production. Using audio machine learning for voice pathology classification represents an innovative approach to diagnosing a wide range of voice disorders. Despite extensive research in this area, there remains a significant gap in the development of classifiers and their ability to adapt and generalize effectively. This thesis aims to address this gap by contributing new insights and methods. This research provides a comprehensive exploration of automatic voice pathology classification, focusing on challenges such as data limitations and the potential of integrating multiple modalities to enhance diagnostic accuracy and adaptability. To achieve generalization capabilities and enhance the flexibility of the classifier across diverse types of voice disorders, this research explores various datasets and pathology types comprehensively. It covers a broad range of voice disorders, including functional dysphonia, phonotrauma, laryngeal neoplasm, unilateral vocal paralysis, and COVID-19-related vocal conditions. The study also includes an analysis of diverse vocal and respiratory sounds to further improve classifier adaptability and accuracy. This approach involves experimentation with different datasets, including the Far Eastern Memorial Hospital, Saarbruecken Voice Database, Virufy, Coswara, COVID-19, and SPRsound datasets, each representing distinct voice and respiratory sounds and pathology types. Additionally, it encompasses diverse vocal and respiratory sounds, such as sustained vowels, speech, cough, breathing, and electroglottographic signals. Throughout the design, implementation, and evaluation of the classifiers, this research focuses on the feature extraction stage, the design of deep learning architectures, and the utilization of augmentation techniques tailored to voice pathology data. As a result of this process, this dissertation introduces five computational models, each designed to address specific challenges in voice pathology classification and COVID-19 detection: Fusion of Medical Data: This model integrates medical data with audio recordings within a modular deep learning framework. By combining relevant medical descriptors with acoustic features, it enhances classification accuracy. Augmentation and Variable-Length Processing: This architecture employs augmentation techniques such as colored noise injection and variable-length segmentation. These methods enable the model to handle recordings of varying durations, addressing the challenge of data scarcity. Incorporation of Electroglottographic Data: This model integrates EGG signals with audio data and medical descriptors within a unified deep learning framework, enhancing classification accuracy by leveraging additional physiological information. Attention-Guided Multimodal Architecture: Utilizing an attention mechanism, this architecture dynamically selects the most relevant audio modality (respiratory sounds, vowels) for each classification decision. This approach is particularly useful in scenarios where not all types of recordings are available. Fully Convolutional Network for Respiratory Sound Classification: This model introduces a fully convolutional network capable of processing audio signals of arbitrary duration without the need for segmentation or padding. It is specifically designed for the classification of respiratory sounds. These models collectively contribute to significant advancements in diagnostic accuracy, adaptability, and generalization capabilities in the field of voice pathology and respiratory sound classification. Key Findings The key findings of the thesis can be summarized as follows: Processing audio recordings as 2D, single-channel images through convolutional neural networks yields superior classification performance. The most effective audio feature vector for the classification of voice disorders combines Mel-Frequency Cepstral Coefficients, fundamental frequency, and perturbation measurements such as jitter and Harmonics-to-Noise Ratio (HNR). Incorporating medical data and demographic parameters into the voice disorder classification system significantly enhances accuracy. Integrating Electroglottographic data into a trimodal architecture with medical and audio parameters leads to improved system accuracy. A fully convolutional network capable of handling recordings of arbitrary duration demonstrates higher classification performance compared to conventional convolutional networks. Introducing a variable-length segmentation algorithm tailored to the duration of audio recordings represents a novel voice pathology augmentation technique. Injecting an ensemble of colored noise proves to be an effective voice data augmentation technique for voice pathology classification. Combining multiple audio sounds (respiratory sounds, voice, speech) improves system accuracy for COVID-19 detection. Introducing an attention-guided mechanism for modality weighting improves classifier accuracy and adaptability, particularly in datasets where not all types of recordings are available. In respiratory sound classification, a deep learning architecture with a fully convolutional neural network increases classification accuracy by processing recordings without segmentation.
