Interpretable Machine Learning for Machine Listening
Recent years have witnessed a significant interest in interpretable machine learning (IML) research that develops techniques to analyse machine learning (ML) models. Understanding ML models is essential to gain trust in their predictions and to improve datasets, model architectures and training techniques. The majority of effort in IML research has been in analysing models that classify images or structured data and comparatively less work exists that analyses models for other domains. This research focuses on developing novel IML methods and on extending existing methods to understand machine listening models that analyse audio. In particular, this thesis reports the results of three studies that apply three different IML methods to analyse five singing voice detection (SVD) models that predict singing voice activity in musical audio excerpts. The first study introduces SoundLIME (SLIME a method to generate temporal, spectral or time-frequency explanations for predictions of any machine listening model. The study involves applying SLIME to analyse the trustworthiness of three SVD models for some carefully selected instances. Results indicate that SLIME effectively identifies that the binary decision tree model is untrustworthy and may not generalise. Moreover, the study analyses the behaviour of SLIME for two input parameters and the results suggest that the choice of suitable values for those parameters is essential to generate reliable explanations from SLIME. The second study introduces a novel method to perform activation max- imisation (AM), a technique that synthesises examples that maximally activate the components (neurons, layers) of a deep neural network (DNN). The method uses a generative adversarial network as a prior in the AM pipeline. The study involves applying the method to synthesise examples for understanding two DNN-based SVD models. Examples that the method synthesises for the output layer neurons in both the models exhibit the presence of vocal and non-vocal characteristics for their respective inputs suggesting that those neu- rons have learnt to detect high-level class concepts. The study also introduces and demonstrates a method for quantitatively selecting suitable values for AM hyper-parameters. The observation about the presence of class characteristics in the synthesised examples is further supported by the results of an online perceptual study involving 23 participants. The third study demonstrates that feature inversion, a method to invert features (handcrafted or learned) back to the input space, is an effective method for explaining DNN predictions. The study also involves applying feature inversion to understand features that each layer of the DNN-based SVD model preserves. The qualitative analysis of inverted representations corresponding to the deepest hidden layer suggests that the representations corresponding to the vocal and non-vocal excerpts contain energy mostly in the higher and lower frequency regions, respectively. In conclusion, this thesis contributes to IML research by developing novel post-hoc analysis methods and to machine listening research by providing effective tools for investigating and understanding machine listening models. Hopefully, insights from model analysis will assist in developing trustworthy ML models with better generalisation capabilities.
