Loading…

Heterogeneous Graph Network for Action Detection

Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher dema...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on circuits and systems for video technology 2024-09, Vol.34 (9), p.7962-7974
Main Authors:	Zhao, Yisheng, Zhu, Huaiyu, Huan, Ruohong, Bao, Yaoqi, Pan, Yun
Format:	Article
Language:	English
Subjects:	Action detection Algorithms Cognition Context modeling graph network Graph theory Graphs Heterogeneity heterogeneous graph Image edge detection Nodes Reasoning relation reasoning Self-supervised learning Semantics spatio-temporal action detection Spatiotemporal data Task analysis Transformers Video
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher demands on the reasoning capability of the method, yet a method capable of holistic reasoning is still lacking. To this end, we propose a heterogeneous graph network, which aims to reason the spatial-temporal interactions among different types of nodes (video entities) and edges (inter-entity relations). Concretely, it includes spatial and temporal graphs, which are alternately updated. The spatial graph contains nodes of person appearance, person pose, object appearance, and hand interaction, and the temporal graph has person nodes at different moments. For information aggregation, we propose a person-centric heterogeneous graph reasoning algorithm, which introduces heterogeneity into the graphs through node-type-specific projections and modulated edge-type-specific representations. We find that the introduction of heterogeneity enriches the model's ability to understand multi-modality, which facilitates better parsing of complex semantic relations in videos and potentially leads to further mining of spatial-temporal interactions between entities in the future. Experimental results on four public datasets demonstrate the superiority of our method. Code is available at https://github.com/actiondetection .
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2024.3383477