Loading…

Learning 3D Skeletal Representation From Transformer for Action Recognition

Skeleton-based human action recognition has attracted significant interest due to its simplicity and good accuracy. Diverse end-to-end trainable frameworks based on skeletal representation have been proposed so far to map the representation to human action classes better. Most skeleton-based human a...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access 2022, Vol.10, p.67541-67550
Main Authors:	Cha, Junuk, Saqlain, Muhammad, Kim, Donguk, Lee, Seungeun, Lee, Seongyeong, Baek, Seungryul
Format:	Article
Language:	English
Subjects:	3D representation action recognition Human activity recognition human mesh Human motion Image reconstruction Learning Sensors Skeleton Task analysis Three-dimensional displays Training transformer Transformers Videos
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Skeleton-based human action recognition has attracted significant interest due to its simplicity and good accuracy. Diverse end-to-end trainable frameworks based on skeletal representation have been proposed so far to map the representation to human action classes better. Most skeleton-based human action recognition approaches are based on the skeletons, which are heuristically pre-defined by the commercial sensors. Nevertheless, it is not confirmed whether the sensor-captured skeletons are the best representation of human bodies for the action recognition task, while in general, the dedicated representation is required for achieving the successful performance on subsequent tasks such as action recognition. In this paper, we try to deal with the issue by explicitly learning the skeletal representation in the context of the human action recognition task. We start our investigation by reconstructing 3D meshes of the human bodies from RGB videos. Then we involve the transformer architecture to sample the most informative skeletal representation from reconstructed 3D meshes, considering the inner and inter structural relationship of 3D meshes and sensor-captured skeletons. Experimental results on challenging human action recognition benchmarks (i.e., SYSU and UTD-MHAD datasets) have shown the superiority of our skeletal representation compared to the sensor-captured skeletons for the action recognition task.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3185058