Loading…
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled ent...
Saved in:
Published in: | arXiv.org 2024-04 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Han, Zeyu Zhu, Fangrui Lao, Qianru Jiang, Huaizu |
description | Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2895042686</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2895042686</sourcerecordid><originalsourceid>FETCH-proquest_journals_28950426863</originalsourceid><addsrcrecordid>eNqNi0sKwkAQRAdBUNQ7NLgOxMnHuDVEdGtcuQmjtslIMhO7J35ubxAP4KrqUa8GYiyDYOEloZQjMWO--b4v46WMomAsTkck63FlHezxikTalJC9WkJmbQ2ktul7heZLD60gd9SdXUeqhlw3ulak3RvW6J6IBnaNKpFBmQukqnX9iadieFU14-yXEzHfZId067Vk7x2yK262I9NPhUxWkR_KOImD_6wP_-lGeA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2895042686</pqid></control><display><type>article</type><title>Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions</title><source>Publicly Available Content Database</source><creator>Han, Zeyu ; Zhu, Fangrui ; Lao, Qianru ; Jiang, Huaizu</creator><creatorcontrib>Han, Zeyu ; Zhu, Fangrui ; Lao, Qianru ; Jiang, Huaizu</creatorcontrib><description>Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Similarity</subject><ispartof>arXiv.org, 2024-04</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2895042686?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>777,781,25734,36993,44571</link.rule.ids></links><search><creatorcontrib>Han, Zeyu</creatorcontrib><creatorcontrib>Zhu, Fangrui</creatorcontrib><creatorcontrib>Lao, Qianru</creatorcontrib><creatorcontrib>Jiang, Huaizu</creatorcontrib><title>Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions</title><title>arXiv.org</title><description>Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.</description><subject>Datasets</subject><subject>Similarity</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi0sKwkAQRAdBUNQ7NLgOxMnHuDVEdGtcuQmjtslIMhO7J35ubxAP4KrqUa8GYiyDYOEloZQjMWO--b4v46WMomAsTkck63FlHezxikTalJC9WkJmbQ2ktul7heZLD60gd9SdXUeqhlw3ulak3RvW6J6IBnaNKpFBmQukqnX9iadieFU14-yXEzHfZId067Vk7x2yK262I9NPhUxWkR_KOImD_6wP_-lGeA</recordid><startdate>20240409</startdate><enddate>20240409</enddate><creator>Han, Zeyu</creator><creator>Zhu, Fangrui</creator><creator>Lao, Qianru</creator><creator>Jiang, Huaizu</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240409</creationdate><title>Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions</title><author>Han, Zeyu ; Zhu, Fangrui ; Lao, Qianru ; Jiang, Huaizu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28950426863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Similarity</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Zeyu</creatorcontrib><creatorcontrib>Zhu, Fangrui</creatorcontrib><creatorcontrib>Lao, Qianru</creatorcontrib><creatorcontrib>Jiang, Huaizu</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Han, Zeyu</au><au>Zhu, Fangrui</au><au>Lao, Qianru</au><au>Jiang, Huaizu</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions</atitle><jtitle>arXiv.org</jtitle><date>2024-04-09</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-04 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2895042686 |
source | Publicly Available Content Database |
subjects | Datasets Similarity |
title | Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T03%3A02%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Zero-shot%20Referring%20Expression%20Comprehension%20via%20Structural%20Similarity%20Between%20Images%20and%20Captions&rft.jtitle=arXiv.org&rft.au=Han,%20Zeyu&rft.date=2024-04-09&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2895042686%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28950426863%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2895042686&rft_id=info:pmid/&rfr_iscdi=true |