Loading…
Jointly learning to align and transcribe using attention-based alignment and uncertainty-to-weigh losses
End-to-end Automatic Speech Recognition (ASR) models with attention, especially the Joint Connectionist Temporal Classification (CTC) and Attention in Encoder-Decoder models have shown promising results. In this joint CTC and Attention framework, misalignment of attention with the ground truth is no...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | End-to-end Automatic Speech Recognition (ASR) models with attention, especially the Joint Connectionist Temporal Classification (CTC) and Attention in Encoder-Decoder models have shown promising results. In this joint CTC and Attention framework, misalignment of attention with the ground truth is not penalised, as the focus is on optimising only the CTC and Attention cost functions. In this paper, a function that additionally minimizes alignment errors is introduced. This function is expected to enable the ASR system to attend to the right part of the input sequence, and in turn, minimize alignment and transcription errors. We also implement a dynamic weighting of losses corresponding with the tasks of CTC, attention, and alignment. We demonstrate that in many cases, the proposed design framework results in better performance and faster convergence. We show results on two datasets - TIMIT and Librispeech 100 hours for the phone recognition task by taking the alignments from a previously trained monophone Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) model. |
---|---|
ISSN: | 2474-915X |
DOI: | 10.1109/SPCOM50965.2020.9179519 |