Loading…

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning

Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat vi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chen, Zhenfang, Zhou, Qinhong, Shen, Yikang, Hong, Yining, Sun, Zhiqing, Gutfreund, Dan, Gan, Chuang
Format:	Conference Proceeding
Language:	English
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	1262
container_issue	2
container_start_page	1254
container_title
container_volume	38
creator	Chen, Zhenfang Zhou, Qinhong Shen, Yikang Hong, Yining Sun, Zhiqing Gutfreund, Dan Gan, Chuang
description	Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git
doi_str_mv	10.1609/aaai.v38i2.27888
format	conference_proceeding
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1609_aaai_v38i2_27888</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1609_aaai_v38i2_27888</sourcerecordid><originalsourceid>FETCH-LOGICAL-c196t-ac17b6f0e0002029e0a7817da17cbf70d1cca0ddd8cbf7dd720e687d82b574c03</originalsourceid><addsrcrecordid>eNotkEtPwzAQhC0EElXpnWP-gIMfSWwfIaJQUQmECtdo40cSlMaVnYL49ySle9kdaXY0-hC6pSSlBVF3ANCl31x2LGVCSnmBFoyLDPOskJfTTXOFc67UNVrF-EWmyRSlVCzQ-rOLR-iTsoVuwN7hXeuPTTsmb8HvD2M3NInzIXkZ_E9vTWPxA0RrkvPXu4Xoh8l0g64c9NGuznuJPtaPu_IZb1-fNuX9FmuqihGDpqIuHLFTA0aYsgSEpMIAFbp2ghiqNRBjjJylMYIRW0hhJKtzkWnCl4j85-rgYwzWVYfQ7SH8VpRUM4pqRlGdUFQnFPwPDNFTyg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning</title><source>World Web Science Journals</source><creator>Chen, Zhenfang ; Zhou, Qinhong ; Shen, Yikang ; Hong, Yining ; Sun, Zhiqing ; Gutfreund, Dan ; Gan, Chuang</creator><creatorcontrib>Chen, Zhenfang ; Zhou, Qinhong ; Shen, Yikang ; Hong, Yining ; Sun, Zhiqing ; Gutfreund, Dan ; Gan, Chuang</creatorcontrib><description>Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git</description><identifier>ISSN: 2159-5399</identifier><identifier>EISSN: 2374-3468</identifier><identifier>DOI: 10.1609/aaai.v38i2.27888</identifier><language>eng</language><ispartof>Proceedings of the ... AAAI Conference on Artificial Intelligence, 2024, Vol.38 (2), p.1254-1262</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Chen, Zhenfang</creatorcontrib><creatorcontrib>Zhou, Qinhong</creatorcontrib><creatorcontrib>Shen, Yikang</creatorcontrib><creatorcontrib>Hong, Yining</creatorcontrib><creatorcontrib>Sun, Zhiqing</creatorcontrib><creatorcontrib>Gutfreund, Dan</creatorcontrib><creatorcontrib>Gan, Chuang</creatorcontrib><title>Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning</title><title>Proceedings of the ... AAAI Conference on Artificial Intelligence</title><description>Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git</description><issn>2159-5399</issn><issn>2374-3468</issn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><recordid>eNotkEtPwzAQhC0EElXpnWP-gIMfSWwfIaJQUQmECtdo40cSlMaVnYL49ySle9kdaXY0-hC6pSSlBVF3ANCl31x2LGVCSnmBFoyLDPOskJfTTXOFc67UNVrF-EWmyRSlVCzQ-rOLR-iTsoVuwN7hXeuPTTsmb8HvD2M3NInzIXkZ_E9vTWPxA0RrkvPXu4Xoh8l0g64c9NGuznuJPtaPu_IZb1-fNuX9FmuqihGDpqIuHLFTA0aYsgSEpMIAFbp2ghiqNRBjjJylMYIRW0hhJKtzkWnCl4j85-rgYwzWVYfQ7SH8VpRUM4pqRlGdUFQnFPwPDNFTyg</recordid><startdate>20240325</startdate><enddate>20240325</enddate><creator>Chen, Zhenfang</creator><creator>Zhou, Qinhong</creator><creator>Shen, Yikang</creator><creator>Hong, Yining</creator><creator>Sun, Zhiqing</creator><creator>Gutfreund, Dan</creator><creator>Gan, Chuang</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20240325</creationdate><title>Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning</title><author>Chen, Zhenfang ; Zhou, Qinhong ; Shen, Yikang ; Hong, Yining ; Sun, Zhiqing ; Gutfreund, Dan ; Gan, Chuang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c196t-ac17b6f0e0002029e0a7817da17cbf70d1cca0ddd8cbf7dd720e687d82b574c03</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Chen, Zhenfang</creatorcontrib><creatorcontrib>Zhou, Qinhong</creatorcontrib><creatorcontrib>Shen, Yikang</creatorcontrib><creatorcontrib>Hong, Yining</creatorcontrib><creatorcontrib>Sun, Zhiqing</creatorcontrib><creatorcontrib>Gutfreund, Dan</creatorcontrib><creatorcontrib>Gan, Chuang</creatorcontrib><collection>CrossRef</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Zhenfang</au><au>Zhou, Qinhong</au><au>Shen, Yikang</au><au>Hong, Yining</au><au>Sun, Zhiqing</au><au>Gutfreund, Dan</au><au>Gan, Chuang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning</atitle><btitle>Proceedings of the ... AAAI Conference on Artificial Intelligence</btitle><date>2024-03-25</date><risdate>2024</risdate><volume>38</volume><issue>2</issue><spage>1254</spage><epage>1262</epage><pages>1254-1262</pages><issn>2159-5399</issn><eissn>2374-3468</eissn><abstract>Knowledge-based visual reasoning remains a daunting task since it not only requires machines to interpret the concepts and relationships from visual scenes but also associate them with external world knowledge to conduct a chain of reasoning on open-world questions. Previous works, however, treat visual perception and language-based reasoning as two independent modules, failing to attend to both modules throughout all stages of reasoning. To this end, we propose Visual Chain-of-thought Prompting (VCTP) for knowledge-based reasoning, which involves the interaction between visual content and natural language in an iterative step-by-step reasoning manner. VCTP contains three stages, see, think, and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to key visual concepts from natural language questions adaptively. It then transforms key visual context into text context for prompting with a visual captioning model, and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, which is then passed through a cross-modality classifier to verify that it’s consistent with the visual context. We iterate through the think-confirm stages to ensure the verified rationale is consistent with the answer. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our VCTP enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines. Our code is available at https://github.com/UMass-Foundation-Model/VisualCoT.git</abstract><doi>10.1609/aaai.v38i2.27888</doi><tpages>9</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 2159-5399
ispartof	Proceedings of the ... AAAI Conference on Artificial Intelligence, 2024, Vol.38 (2), p.1254-1262
issn	2159-5399 2374-3468
language	eng
recordid	cdi_crossref_primary_10_1609_aaai_v38i2_27888
source	World Web Science Journals
title	Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A54%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Visual%20Chain-of-Thought%20Prompting%20for%20Knowledge-Based%20Visual%20Reasoning&rft.btitle=Proceedings%20of%20the%20...%20AAAI%20Conference%20on%20Artificial%20Intelligence&rft.au=Chen,%20Zhenfang&rft.date=2024-03-25&rft.volume=38&rft.issue=2&rft.spage=1254&rft.epage=1262&rft.pages=1254-1262&rft.issn=2159-5399&rft.eissn=2374-3468&rft_id=info:doi/10.1609/aaai.v38i2.27888&rft_dat=%3Ccrossref%3E10_1609_aaai_v38i2_27888%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c196t-ac17b6f0e0002029e0a7817da17cbf70d1cca0ddd8cbf7dd720e687d82b574c03%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true