Loading…

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-v...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-12
Main Authors:	Cijo Jose, Moutakanni, Théo, Kang, Dahyun, Baldassarre, Federico, Darcet, Timothée, Hu, Xu, Li, Daniel, Szafraniec, Marc, Ramamonjisoa, Michaël, Oquab, Maxime, Siméoni, Oriane, Vo, Huy V, Labatut, Patrick, Bojanowski, Piotr
Format:	Article
Language:	English
Subjects:	Alignment Coders Image segmentation Performance enhancement Semantic segmentation Vision Visual tasks
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named dino.txt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
ISSN:	2331-8422