Loading…

Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention

Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differ...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia 2024, Vol.26, p.11204-11218
Main Authors:	Xiong, Zeyu, Liu, Daizong, Fang, Xiang, Qu, Xiaoye, Dong, Jianfeng, Zhu, Jiahao, Tang, Keke, Zhou, Pan
Format:	Article
Language:	English
Subjects:	Cross-modal Feature extraction Grounding masked attention memory network Object tracking Semantics Target tracking Task analysis tracking Visualization VSG
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3453062