Loading…
Combine multi-order representation learning and frame optimization learning for skeleton-based action recognition
Skeleton-based action recognition has broad application prospects in many fields such as virtual reality. Currently, the most popular way is to employ Graph Convolutional Networks (GCNs) or Hypergraph Convolutional Networks (HGCNs) for this task. However, GCN-based methods may heavily rely on the ph...
Saved in:
Published in: | Digital signal processing 2025-01, Vol.156, p.104823, Article 104823 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Skeleton-based action recognition has broad application prospects in many fields such as virtual reality. Currently, the most popular way is to employ Graph Convolutional Networks (GCNs) or Hypergraph Convolutional Networks (HGCNs) for this task. However, GCN-based methods may heavily rely on the physical connectivity relationship between joints while lack the capture of higher-order information about interactions among distant joints, and HGCN-based methods usually introduce unnecessary noise when capturing low-order information of skeleton structures with simple topology. Besides, the current methods do not deal well with redundant frames and confusing frames. These limitations hinder the improvement of recognition accuracy. In this paper, we propose a novel network, called Hyper-Net, which combines multi-order representation learning and frame optimization learning for skeleton-based action recognition. Specifically, the proposed Hyper-Net contains Temporal-Channel Aggregation Graph Convolution (TCA-GC), Spatial-Temporal Aggregation Hypergraph Convolution (STA-HC) and Frame Optimization Learning (F-OL) modules. The TCA-GC aggregates low-order and local information from simple joint and bone topologies across different temporal and channel dimensions. The STA-HC captures high-order and global information from complex motion streams as well as solving the problem of spatial-temporal weight imbalance. The F-OL can adaptively extract key frames and distinguish confusing frames, thus improving the ability of the network to recognize confusing actions. A large number of experiments are conducted on the NTU RGB+D, NTU RGB+D 120 and NW-UCLA datasets for action recognition task. Experimental results demonstrate the superiority and effectiveness of the proposed network.
•A novel network combining graph convolution and hypergraph convolution is proposed.•A new hypergraph convolution, STA-HC, dynamically captures global motion features.•A Frame Optimization Learning enhances the recognition ability of confusing actions.•The experimental results demonstrate the effectiveness of the method. |
---|---|
ISSN: | 1051-2004 |
DOI: | 10.1016/j.dsp.2024.104823 |