Loading…

MIPA-ResGCN: a multi-input part attention enhanced residual graph convolutional framework for sign language recognition

Sign language (SL) is used as primary mode of communication by individuals who experience deafness and speech disorders. However, SL creates an inordinate communication barrier as most people are not acquainted with it. To solve this problem, many technological solutions using wearable devices, vide...

Full description

Saved in:

Bibliographic Details
Published in:	Computers & electrical engineering 2023-12, Vol.112, p.109009, Article 109009
Main Authors:	Naz, Neelma, Sajid, Hasan, Ali, Sara, Hasan, Osman, Ehsan, Muhammad Khurram
Format:	Article
Language:	English
Subjects:	Multi input architecture Part attention Pose sequence modeling ResGCN Sign language recognition Visualization
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Sign language (SL) is used as primary mode of communication by individuals who experience deafness and speech disorders. However, SL creates an inordinate communication barrier as most people are not acquainted with it. To solve this problem, many technological solutions using wearable devices, video, and depth cameras have been put forth. The ubiquitous nature of cameras in contemporary devices has resulted in the emergence of sign language recognition (SLR) using video sequence as a viable and unobtrusive substitute. Nonetheless, the utilization of SLR methods based on visual features, commonly known as appearance-based methods, presents notable computational complexities. In response to these challenges, this study introduces an accurate and computationally efficient pose-based approach for SLR. Our proposed approach comprises three key stages: pose extraction, handcrafted feature generation, and feature space mapping and recognition. Initially, an efficient off-the-shelf pose extraction algorithm is employed to extract pose information of various body parts of a subject captured in a video. Then, a multi-input stream has been generated using handcrafted features, i.e., joints, bone lengths, and bone angles. Finally, an efficient and lightweight residual graph convolutional network (ResGCN) along with a novel part attention mechanism, is proposed to encode body's spatial and temporal information in a compact feature space and recognize the signs performed. In addition to enabling effective learning during model training and offering cutting-edge accuracy, the proposed model significantly reduces computational complexity. Our proposed method is assessed on five challenging SL datasets, WLASL-100, WLASL-300, WLASL-1000, LSA-64, and MINDS-Libras, achieving state-of-the-art (SOTA) accuracies of 83.33 %, 72.90 %, 64.92 %, 100± 0 %, and 96.70± 1.07 %, respectively. Compared to previous approaches, we achieve superior performance while incurring a lower computational cost.
ISSN:	0045-7906 1879-0755
DOI:	10.1016/j.compeleceng.2023.109009