Loading…

A multi-modal fusion framework for continuous sign language recognition based on multi-layer self-attention mechanism

Some of the existing continuous sign language recognition (CSLR) methods require alignment. However, this is time-consuming, and breaks the continuity of the frame sequence, and also affects the subsequent process of CSLR. In this paper, we propose a multi-modal network framework for CSLR based on a...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of intelligent & fuzzy systems 2022-01, Vol.43 (4), p.4303-4316
Main Authors:	Xue, Cuihong, Yu, Ming, Yan, Gang, Qin, Mengxian, Liu, Yuehao, Jia, Jingli
Format:	Article
Language:	English
Subjects:	Artificial neural networks Datasets Feature extraction Multilayers Optical flow (image analysis) Optimization Recognition Sign language
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Some of the existing continuous sign language recognition (CSLR) methods require alignment. However, this is time-consuming, and breaks the continuity of the frame sequence, and also affects the subsequent process of CSLR. In this paper, we propose a multi-modal network framework for CSLR based on a multi-layer self-attention mechanism. We propose a 3D convolution residual neural network (CR3D) and a multi-layer self-attention network (ML-SAN) for the feature extraction stage. The CR3D obtains the short-term spatiotemporal features of the RGB and optical flow image streams, whereas the ML-SAN uses a bi-gated recurrent unit (BGRU) to model the long-term sequence relationship and a multi-layer self-attention mechanism to learn the internal relationships between sign language sequences. For the performance optimization stage, we propose a cross-modal spatial mapping loss function, which improves the precision of CSLR by studying the spatial similarity between the video and text domains. Experiments were conducted on two test datasets: the RWTH-PHOENIX-Weather multi-signer dataset, and a Chinese SL (CSL) dataset. The results show that the proposed method can obtain state-of-the-art recognition performance on the two datasets, with word error rate (WER) value of 24.4% and accuracy value of 14.42%, respectively.
ISSN:	1064-1246 1875-8967
DOI:	10.3233/JIFS-211697