Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech
One desideratum in designing cognitive robots is autonomous learning of communication skills, just like humans. The primary step towards this goal is vocabulary acquisition. Being different from the training procedures of the state-of-the-art automatic speech recognition (ASR) systems, vocabulary acquisition cannot rely on prior knowledge of language in the same way. Like what infants do, the acquisition process should be data-driven with multi-level abstraction and coupled with multi-modal inputs. To avoid lengthy training efforts in a word-by-word interactive learning process, a clever learning agent should be able to acquire vocabularies from continuous speech automatically. The work presented in this thesis is entitled \emph{Constrained Non-negative Matrix Factorization for Vocabulary Acquisition from Continuous Speech}. Enlightened by the extensively studied techniques in ASR, we design computational models to discover and represent vocabularies from continuous speech with little prior knowledge of the language to be learned. Starting with a recently proposed non-negative matrix factorization (NMF) approach to vocabulary acquisition, this work targets at vocabulary representations with high accuracy and fast learning rate. The NMF learning discovers repeated words in spoken data represented as a bag-of-features (BoF) and provides a BoF description of the vocabulary. This thesis advances the state-of-the-art in this area in three respects: (1) Accuracy improvements of NMF-based word discovery and subsequent ASR using the techniques of soft vector quantization, using multiple codebooks, integrating multiple time scales and using multiple contextual dependencies. Experiments show that the obtained accuracies approach those obtained with hidden Markov models (HMM) trained with transcribed data. However, the above improvements of the NMF model are at the expense of high computational complexity and require sufficient labeled data as supervision. These concerns are addressed in contributions (2) and (3) below. (2) The NMF method does not constrain its BoF word representations to be generated by actual data sequences. To implement this constraint, the NMF problem is regularized by expressing graph adjacency between the features of the BoF description. In this context, “adjacency” is “temporal proximity”, but the method was also applied successfully to bag-of-feature representations with spatial proximity in image processing. The regularization desirably guides NMF to detect more realistic patterns in the temporal domain. The method demonstrates superior performance over baselines on data sets of speech, images and documents. (3) The BoF word representations learned by NMF map the acoustic features directly to word activations. With non-negative matrix tri-factorization (NMTF sub-word units are learned, which improves the representation accuracy of the vocabulary and increases the learning rate in the sense that fewer labeled examples are required. The discovered sub-word units are shown to be closely related to HMM states. Therefore NMF and the NMTF are integrated into a process of non-negative Tucker decomposition (NTD) for unsupervised learning of HMMs. Joint training techniques of NTD and HMMs are proposed for unsupervised sequential pattern discovery. With a few labeled data to link the obtained sequential patterns to grounding labels, the model is able to work as a speech recognizer with better performance than both the unsupervised EM training of HMMs and the NMF model. Apart from the improvement in accuracy and learning rate, vocabulary acquisition can now result in an HMM, which greatly simplifies the decoding of new utterances in terms of the acquired vocabulary. Like for contribution (2), the resulting method for unsupervised HMM learning is formulated generically with possible applications outside the domain of speech processing.
