Loading…

Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition

RGB-D-based human action recognition has attracted much attention recently because it can provide more complementary information than a single modality. However, it is difficult for two modalities to effectively learn spatial-temporal information from each other. To facilitate information interactio...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2022-03, Vol.32 (3), p.1498-1509
Main Authors:	Cheng, Jun, Ren, Ziliang, Zhang, Qieshi, Gao, Xiangyang, Hao, Fusheng
Format:	Article
Language:	English
Subjects:	Action recognition Artificial neural networks Compensation Computer architecture cross modality compensation learning Data mining Datasets dynamic image Dynamics Feature extraction Human activity recognition Human motion Image recognition joint optimization Neural networks Optical imaging Performance enhancement Task analysis Three-dimensional displays
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	RGB-D-based human action recognition has attracted much attention recently because it can provide more complementary information than a single modality. However, it is difficult for two modalities to effectively learn spatial-temporal information from each other. To facilitate information interaction between different modalities, a cross-modality compensation convolutional neural network (ConvNet) is proposed for human action recognition, which enhances the discriminative ability by jointly learning compensation features from the RGB and depth modalities. Moreover, we design a cross-modality compensation block (CMCB) to extract compensation features from the RGB and depth modalities. Specifically, CMCB is incorporated into two typical network architectures, ResNet and VGG, to verify the ability to improve the performance of our model. The proposed architecture has been evaluated on three challenging datasets: NTU RGB+D 120, THU-READ and PKU-MMD. We experimentally verify that our proposed model with CMCB is effective for different input types, such as pairs of raw images and dynamic images constructed from the entire RGB-D sequence, and the experimental results show that the proposed framework achieves state-of-the-art performance on all three datasets.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2021.3076165