Loading…

Transferable dual multi-granularity semantic excavating for partially relevant video retrieval

Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reali...

Full description

Saved in:

Bibliographic Details
Published in:	Image and vision computing 2024-09, Vol.149, p.105168, Article 105168
Main Authors:	Cheng, Dingxin, Kong, Shuhan, Jiang, Bin, Guo, Qiang
Format:	Article
Language:	English
Subjects:	Partially relevant video retrieval Semantic excavating Transferable Video-text retrieval
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Partially Relevant Video Retrieval (PRVR) aims to retrieve partially relevant videos from many unlabeled and untrimmed videos according to the query, which is defined as the multiple instance learning problem. The challenge of PRVR is that it utilizes untrimmed videos, which are much closer to reality. The existing methods excavate video-text semantic consistency information insufficiently and lack the capacity to highlight the semantics of key representations. To tackle these issues, we propose a transferable dual multi-granularity semantic excavating network, called T-D3N, to focus on enhancing the learning of dual-modal representations. Specifically, we first introduce a novel transferable textual semantic learning strategy by designing Adaptive Multi-scale Semantic Mining (AMSM) component to excavate significant textual semantic from multiple perspectives. Second, T-D3N distinguishes the feature differences from the frame-wise perspective to better perform contrastive learning between positive and negative samples in the video feature domain, which can further distance the positive and negative samples and improve the probability of positive samples being retrieved by query. Finally, our model constructs multi-grained video temporal dependencies and conducts cross-grained core feature perception, which enables more sufficient multimodal interactions. Extensive experiments are performed on three benchmarks, i.e., ActivityNet Captions, Charades-STA, and TVR, our T-D3N achieves state-of-the-art results. Furthermore, we also confirm that our model is transferable on a broad range of multimodal tasks such as T2VR, VMR, and MMSum. •A dual multi-granularity semantic excavating network (T-D3N) is designed for PRVR.•A novel transferable Adaptive Multi-scale Semantic Mining (AMSM) strategy is proposed for textual modal.•Extensive experimental results on three PRVR datasets demonstrate the validity of T-D3N.•Extensive experiments on a broad range of multimodal tasks demonstrate the transferability of AMSM.
ISSN:	0262-8856
DOI:	10.1016/j.imavis.2024.105168