Loading…
Heterogeneous Graph Network for Action Detection
Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher dema...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2024-09, Vol.34 (9), p.7962-7974 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher demands on the reasoning capability of the method, yet a method capable of holistic reasoning is still lacking. To this end, we propose a heterogeneous graph network, which aims to reason the spatial-temporal interactions among different types of nodes (video entities) and edges (inter-entity relations). Concretely, it includes spatial and temporal graphs, which are alternately updated. The spatial graph contains nodes of person appearance, person pose, object appearance, and hand interaction, and the temporal graph has person nodes at different moments. For information aggregation, we propose a person-centric heterogeneous graph reasoning algorithm, which introduces heterogeneity into the graphs through node-type-specific projections and modulated edge-type-specific representations. We find that the introduction of heterogeneity enriches the model's ability to understand multi-modality, which facilitates better parsing of complex semantic relations in videos and potentially leads to further mining of spatial-temporal interactions between entities in the future. Experimental results on four public datasets demonstrate the superiority of our method. Code is available at https://github.com/actiondetection . |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2024.3383477 |