Loading…

Generalized Hough Transform for Speech Pattern Classification

While typical hybrid neural network architectures for automatic speech recognition (ASR) use a context window of frame-based features, this may not be the best approach to capture the wider temporal context, which contains phonetic and linguistic information that is equally important. In this paper,...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2015-11, Vol.23 (11), p.1963-1972
Main Authors: Dennis, Jonathan, Tran, Huy Dat, Haizhou Li
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:While typical hybrid neural network architectures for automatic speech recognition (ASR) use a context window of frame-based features, this may not be the best approach to capture the wider temporal context, which contains phonetic and linguistic information that is equally important. In this paper, we introduce a system that integrates both the spectral and geometrical shape information from the acoustic spectrum, inspired by research in the field of machine vision. In particular, we focus on the Generalized Hough Transform (GHT), which is a sophisticated technique that can model the geometrical distribution of speech information over the wider temporal context. To integrate the GHT as part of a hybrid-ASR system, we propose to use a neural network, with features derived from the probabilistic Hough voting step of the GHT, to implement an improved version of the GHT where the output of the network represents the conventional target class posteriors. A major advantage of our approach is that each step of the GHT is highly interpretable, particularly compared to deep neural network (DNN) systems which are commonly treated as powerful black-box classifiers that give little insight into how the output is achieved. Experiments are carried out on two speech pattern classification tasks. The first is the TIMIT phoneme classification, which demonstrates the performance of the approach on a standard ASR task. The second is a spoken word recognition challenge, which highlights the flexibility of the approach to capture phonetic information within a longer temporal context.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2015.2459599