Loading…

Long Term Memory-Enhanced Via Causal Reasoning for Text-To-Video Retrieval

The T2VR task aims to retrieve videos that are semantically relevant to the given query text in a large number of unlabeled videos. Most of the existing methods adopt a representation encoding strategy that can only focus on limited contextual information, and lack the ability to focus on the long m...

Full description

Saved in:
Bibliographic Details
Main Authors: Cheng, Dingxin, Kong, Shuhan, Wang, Wenyu, Qu, Meixia, Jiang, Bin
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The T2VR task aims to retrieve videos that are semantically relevant to the given query text in a large number of unlabeled videos. Most of the existing methods adopt a representation encoding strategy that can only focus on limited contextual information, and lack the ability to focus on the long memory of representation sequences. Besides, they also ignore the semantic causal impact of the predecessor on the successor in the sequence. To tackle this issue, we propose a new long term memory-enhanced via causal reasoning to better learn the feature sequence of video and text. Firstly, we design semantic causal reasoning to allow video and text to adaptively capture their respective feature sequence's full-memory contextual causal relations and enhance the consistency of the semantic relations. Secondly, we perform key feature reweighting in the memory space of video and text respectively to make the key information focused. Finally, extensive experiments on three public datasets, i.e., MSR-VTT, VATEX, and TGIF, demonstrate the effectiveness of our proposed method.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10448201