Loading…

Temporal-masked skeleton-based action recognition with supervised contrastive learning

Recent years have seen the resurgence of self-supervised learning in visual representation thanks to Contrastive Learning and Masked Image Modeling. The existing self-supervised methods for skeleton-based action recognition typically learn feature invariance of the data only through contrastive lear...

Full description

Saved in:

Bibliographic Details
Published in:	Signal, image and video processing image and video processing, 2023-07, Vol.17 (5), p.2267-2275
Main Authors:	Zhao, Zhifeng, Chen, Guodong, Lin, Yuxiang
Format:	Article
Language:	English
Subjects:	Activity recognition Computer Imaging Computer Science Image Processing and Computer Vision Invariance Masking Multimedia Information Systems Occlusion Original Paper Pattern Recognition and Graphics Representations Self-supervised learning Signal,Image and Speech Processing Vision
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent years have seen the resurgence of self-supervised learning in visual representation thanks to Contrastive Learning and Masked Image Modeling. The existing self-supervised methods for skeleton-based action recognition typically learn feature invariance of the data only through contrastive learning. In this paper, we propose a contrast learning method combined with a temporal-masking mechanism of skeleton sequences to encourage the network able to learn action representations other than feature invariance, e.g., occlusion invariance, by implicitly reconstructing the masked sequences. However, the direct masking mechanism destroys the feature consistency of the samples, for which we propose Supervised Positive Sample Mining and self-attention module for embeddings to improve the generalization of the model. First of all, supervised contrastive learning can improve the robustness of models using prior knowledge of labels. Secondly, to avoid excessive masking mechanism that hinders the model from learning the correct occlusion invariance, a self-attention mechanism is necessary, which further discriminate the distance for each action class in the feature space. The results of various experimental protocols on NTU 60, NTU 120, PKU-MMD datasets demonstrate the advantages of our method and that our method outperforms the existing state-of-the-art contrastive methods. Code is available at https://github.com/ZZFCV/SASOiCLR .
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-022-02442-6