Loading…

ESAformer: Enhanced Self-Attention for Automatic Speech Recognition

In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism. In particular, the former is used to capture multi-order feature interaction and the latter is for gl...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE signal processing letters 2024, Vol.31, p.471-475
Main Authors:	Li, Junhua, Duan, Zhikui, Li, Shiren, Yu, Xinmei, Yang, Guangguang
Format:	Article
Language:	English
Subjects:	Attention Automatic speech recognition Convolution Datasets enhanced self-attention Feature extraction Logic gates multi-order interaction Recursion Speech recognition Tensors Testing Training transformer Transformers Voice recognition
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism. In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction. In addition, the location of interest that is suitable for inserting the ESA is also worth being explored. In this letter, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named ESAformer. The effectiveness of the ESAformer has been validated using three datasets, that are Aishell-1, HKUST and WSJ. Experimental results show that, compared with the Transformer network, 0.8% CER, 1.2% CER and 0.7%/0.4% WER, improvement for these three mentioned datasets, respectively, can be achieved.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2024.3358754