Loading…
Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies
Human action recognition from skeleton data has drawn a lot of attention from researchers due to the availability of thousands of real videos with many challenges. Existing works attempted to model the spatial characteristics and temporal dependencies of 3D joints using dynamic time warping, hand-cr...
Saved in:
Published in: | Journal of ambient intelligence and humanized computing 2022-08, Vol.13 (8), p.3729-3746 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Human action recognition from skeleton data has drawn a lot of attention from researchers due to the availability of thousands of real videos with many challenges. Existing works attempted to model the spatial characteristics and temporal dependencies of 3D joints using dynamic time warping, hand-crafted, and spatial co-occurrence features. However, the representation derived from the spatial stream overemphasizes the temporal information; thus, it yields limited expressive power. Some studies use skeleton sequences as frames to enhance the expressive power of representations but lose the generalization capability because the derived temporal smoothness is specific to a particular dataset. The proposed work uses joint distance maps as a base representation that encodes the spatial and temporal information to color texture images. We increase the expressive power by extracting the feature maps from pre-trained networks on ImageNet to diversify the texture representation and propose a network architecture to model the temporal dependency explicitly. We also explore various fusion strategies to generate diverse representations from the feature maps of the pre-trained networks. The experimental results show that the proposed method achieves the best recognition accuracy when using decision-level fusion with meta-learners (Random Forest). The analysis also reveals that the use of feature-level fusion yields relatively good results in terms of the trade-off, i.e., on par recognition performance with some decision-level fusion strategies while having less tunable parameters. Extensive experimental results and comparative analysis on three benchmark datasets prove that the proposed representation and network not only yield better recognition accuracy but also exhibit stronger generalization capability on multiple datasets. |
---|---|
ISSN: | 1868-5137 1868-5145 |
DOI: | 10.1007/s12652-022-03848-3 |