Loading…

Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition

Currently, video-based Sign language recognition (SLR) has been extensively studied using deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In addition, using multi view attention mechanism along with CNNs could be an appealing solution that can...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of visual communication and image representation 2021-07, Vol.78, p.103161, Article 103161
Main Authors:	M., Suneetha, M.V.D., Prasad, P.V.V., Kishore
Format:	Article
Language:	English
Subjects:	Attention models Deep learning Motion modelled Multi view Sign language recognition
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Currently, video-based Sign language recognition (SLR) has been extensively studied using deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs). In addition, using multi view attention mechanism along with CNNs could be an appealing solution that can be considered in order to make the machine interpretation process immune to finger self-occlusions. The proposed multi stream CNN mixes spatial and motion modelled video sequences to create a low dimensional feature vector at multiple stages in the CNN pipeline. Hence, we solve the view invariance problem into a video classification problem using attention model CNNs. For superior network performance during training, the signs are learned through a motion attention network thus focusing on the parts that play a major role in generating a view based paired pooling using a trainable view pair pooling network (VPPN). The VPPN, pairs views to produce a maximally distributed discriminating features from all the views for an improved sign recognition. The results showed an increase in recognition accuracies on 2D video sign language datasets. Similar results were obtained on benchmark action datasets such as NTU RGB D, MuHAVi, WEIZMANN and NUMA as there is no multi view sign language dataset except ours. •Multi view sign language recognition with deep learning.•Motion based attention model for accurate spatial movement identification.•View pair pooling network to learn multiple paired views during training.•An end-to-end trainable multi view sign language learning framework.•Results are encouraging to develop view invariant sign language recognition.
ISSN:	1047-3203 1095-9076
DOI:	10.1016/j.jvcir.2021.103161