Loading…
Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition
•Automatic speech recognition can be cast as a realization of compressive sensing.•Posterior probabilities are suitable features for exemplar-based sparse modeling.•Posterior-based sparse representation meets statistical speech recognition formalism.•Dictionary learning reduces collection size of ex...
Saved in:
Published in: | Speech communication 2016-02, Vol.76, p.230-244 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Automatic speech recognition can be cast as a realization of compressive sensing.•Posterior probabilities are suitable features for exemplar-based sparse modeling.•Posterior-based sparse representation meets statistical speech recognition formalism.•Dictionary learning reduces collection size of exemplars and improves the performance.•Collaborative hierarchical sparsity exploits temporal information in continuous speech.
In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models – HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2015.06.002 |