Loading…

Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action Localization

Language-driven action localization aims to search a video segment in an untrimmed video, which is semantically relevant to an input language query. This task is challenging since language queries describe diverse actions with different motion characteristics and semantic granularities. Some actions...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia 2024, Vol.26, p.7451-7461
Main Authors:	Yang, Shuo, Wu, Xinxiao, Shang, Zirui, Luo, Jiebo
Format:	Article
Language:	English
Subjects:	Dynamic pathway Encoding Exploitation exploration Feature extraction Footwear language-driven action localization Localization Location awareness Modules Motion segmentation Proposals Queries Query languages Searching Segments Semantics Task analysis video grounding video moment retrieval
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Language-driven action localization aims to search a video segment in an untrimmed video, which is semantically relevant to an input language query. This task is challenging since language queries describe diverse actions with different motion characteristics and semantic granularities. Some actions, such as "the person takes off their shoes, and goes to the door" , are characterized by complex motion relationships, while others, such as "a person is standing holding a mirror in one hand" , are distinguished by salient body postures. In this paper, we propose a dynamic pathway between an exploitation module and an exploration module for query-aware feature learning to handle the diversity of actions. The exploitation module works in a coarse-to-fine manner, first learns the feature of general motion relationships to search the coarse segment of the target action and then learns the feature of subtle motion changes to predict the refined action boundaries. The exploration module functions in a point-to-area diffusion fashion, first learns the feature of sub-action pattern to search the salient postures of the target action and then learns the feature of temporal dependency to expand the posture frames to the action segment. The exploitation module and the exploration module are dynamically and adaptively selected to learn comprehensive representations of diverse actions to improve the action localization accuracy. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method performs better than existing methods.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3368919