Loading…

Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition

•Automatic speech recognition can be cast as a realization of compressive sensing.•Posterior probabilities are suitable features for exemplar-based sparse modeling.•Posterior-based sparse representation meets statistical speech recognition formalism.•Dictionary learning reduces collection size of ex...

Full description

Saved in:
Bibliographic Details
Published in:Speech communication 2016-02, Vol.76, p.230-244
Main Authors: Dighe, Pranay, Asaei, Afsaneh, Bourlard, Hervé
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Automatic speech recognition can be cast as a realization of compressive sensing.•Posterior probabilities are suitable features for exemplar-based sparse modeling.•Posterior-based sparse representation meets statistical speech recognition formalism.•Dictionary learning reduces collection size of exemplars and improves the performance.•Collaborative hierarchical sparsity exploits temporal information in continuous speech. In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models – HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2015.06.002