Loading…

Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

Open Vocabulary Object Detection (OVD) aims to detect objects from novel classes described by text inputs based on the generalization ability of trained classes. Existing methods mainly focus on transferring knowledge from large Vision and Language models (VLM) to detectors through knowledge distill...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhao, Xiaowei, Liu, Xianglong, Wang, Duorui, Gao, Yajun, Liu, Zhide
Format:	Conference Proceeding
Language:	English
Subjects:	Adaptation models Detectors Linguistics Object detection open vocabulary object detection scene-adaptive prompts Semantics Visualization Vocabulary
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Open Vocabulary Object Detection (OVD) aims to detect objects from novel classes described by text inputs based on the generalization ability of trained classes. Existing methods mainly focus on transferring knowledge from large Vision and Language models (VLM) to detectors through knowledge distillation. However, these approaches show weak ability in adapting to diverse classes and aligning be-tween the image-level pre-training and region-level detection, thereby impeding effective knowledge transfer. Moti-vated by the prompt tuning, we propose scene-adaptive and region-aware multi-modal prompts to address these issues by effectively adapting class-aware knowledge from VLM to the detector at the region level. Specifically, to enhance the adaptability to diverse classes, we design a scene-adaptive prompt generator from a scene perspective to consider both the commonality and diversity of the class distributions, and formulate a novel selection mechanism to facilitate the ac-quisition of common knowledge across all classes and spe-cific insights relevant to each scene. Meanwhile, to bridge the gap between the pre-trained model and the detector, we present a region-aware multi-modal alignment module, which employs the region prompt to incorporate the po-sitional information for feature distillation and integrates textual prompts to align visual and linguistic representations. Extensive experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art models on the OV-COCO and OV-LVIS datasets, sur-passing the current method by 3.0% mAP and 4.6% APr.
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.01584