Loading…

Jointly learning to align and transcribe using attention-based alignment and uncertainty-to-weigh losses

End-to-end Automatic Speech Recognition (ASR) models with attention, especially the Joint Connectionist Temporal Classification (CTC) and Attention in Encoder-Decoder models have shown promising results. In this joint CTC and Attention framework, misalignment of attention with the ground truth is no...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nadig, Shreekantha, Chakraborty, Sumit, Shah, Anuj, Sharma, Chaitanay, Ramasubramanian, V., Rao, Sachit
Format:	Conference Proceeding
Language:	English
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	End-to-end Automatic Speech Recognition (ASR) models with attention, especially the Joint Connectionist Temporal Classification (CTC) and Attention in Encoder-Decoder models have shown promising results. In this joint CTC and Attention framework, misalignment of attention with the ground truth is not penalised, as the focus is on optimising only the CTC and Attention cost functions. In this paper, a function that additionally minimizes alignment errors is introduced. This function is expected to enable the ASR system to attend to the right part of the input sequence, and in turn, minimize alignment and transcription errors. We also implement a dynamic weighting of losses corresponding with the tasks of CTC, attention, and alignment. We demonstrate that in many cases, the proposed design framework results in better performance and faster convergence. We show results on two datasets - TIMIT and Librispeech 100 hours for the phone recognition task by taking the alignments from a previously trained monophone Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) model.
ISSN:	2474-915X
DOI:	10.1109/SPCOM50965.2020.9179519