Loading…

Cascaded Encoders for Unifying Streaming and Non-Streaming ASR

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in...

Full description

Saved in:
Bibliographic Details
Main Authors: Narayanan, Arun, Sainath, Tara N., Pang, Ruoming, Yu, Jiahui, Chiu, Chung-Cheng, Prabhavalkar, Rohit, Variani, Ehsan, Strohman, Trevor
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% - 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.
ISSN:2379-190X
DOI:10.1109/ICASSP39728.2021.9414607