Loading…

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2023-11
Main Authors:	Ji, Yuanfeng, Ge, Chongjian, Kong, Weikai, Xie, Enze, Liu, Zhengying, Li, Zhengguo, Luo, Ping
Format:	Article
Language:	English
Subjects:	Alignment Automation Benchmarks Cognition Cognition & reasoning Cognitive tasks Intelligence Large language models Questions Reasoning Vision
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Ji, Yuanfeng Ge, Chongjian Kong, Weikai Xie, Enze Liu, Zhengying Li, Zhengguo Luo, Ping
description	With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2894147569</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2894147569</sourcerecordid><originalsourceid>FETCH-proquest_journals_28941475693</originalsourceid><addsrcrecordid>eNqNi70KwjAURoMgWLTvcMG50Cbp31hEcaiT4lqiTWNqm2hu8_52cHJy-g6c8y1IQBlLooJTuiIhYh_HMc1ymqYsIOdaOCWhFkZ5McPJtnJAEAiVn-woJtlCNWhlpEPorIObNPfHKNxTGwVXjdqa6Oe9IctODCjD767J9rC_7I7Ry9m3lzg1vfXOzKqhRckTnqdZyf6rPgM0P-M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2894147569</pqid></control><display><type>article</type><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><source>Publicly Available Content Database</source><creator>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</creator><creatorcontrib>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</creatorcontrib><description>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Automation ; Benchmarks ; Cognition ; Cognition & reasoning ; Cognitive tasks ; Intelligence ; Large language models ; Questions ; Reasoning ; Vision</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2894147569?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Ji, Yuanfeng</creatorcontrib><creatorcontrib>Ge, Chongjian</creatorcontrib><creatorcontrib>Kong, Weikai</creatorcontrib><creatorcontrib>Xie, Enze</creatorcontrib><creatorcontrib>Liu, Zhengying</creatorcontrib><creatorcontrib>Li, Zhengguo</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><title>arXiv.org</title><description>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</description><subject>Alignment</subject><subject>Automation</subject><subject>Benchmarks</subject><subject>Cognition</subject><subject>Cognition & reasoning</subject><subject>Cognitive tasks</subject><subject>Intelligence</subject><subject>Large language models</subject><subject>Questions</subject><subject>Reasoning</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi70KwjAURoMgWLTvcMG50Cbp31hEcaiT4lqiTWNqm2hu8_52cHJy-g6c8y1IQBlLooJTuiIhYh_HMc1ymqYsIOdaOCWhFkZ5McPJtnJAEAiVn-woJtlCNWhlpEPorIObNPfHKNxTGwVXjdqa6Oe9IctODCjD767J9rC_7I7Ry9m3lzg1vfXOzKqhRckTnqdZyf6rPgM0P-M</recordid><startdate>20231124</startdate><enddate>20231124</enddate><creator>Ji, Yuanfeng</creator><creator>Ge, Chongjian</creator><creator>Kong, Weikai</creator><creator>Xie, Enze</creator><creator>Liu, Zhengying</creator><creator>Li, Zhengguo</creator><creator>Luo, Ping</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231124</creationdate><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><author>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28941475693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Alignment</topic><topic>Automation</topic><topic>Benchmarks</topic><topic>Cognition</topic><topic>Cognition & reasoning</topic><topic>Cognitive tasks</topic><topic>Intelligence</topic><topic>Large language models</topic><topic>Questions</topic><topic>Reasoning</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Ji, Yuanfeng</creatorcontrib><creatorcontrib>Ge, Chongjian</creatorcontrib><creatorcontrib>Kong, Weikai</creatorcontrib><creatorcontrib>Xie, Enze</creatorcontrib><creatorcontrib>Liu, Zhengying</creatorcontrib><creatorcontrib>Li, Zhengguo</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ji, Yuanfeng</au><au>Ge, Chongjian</au><au>Kong, Weikai</au><au>Xie, Enze</au><au>Liu, Zhengying</au><au>Li, Zhengguo</au><au>Luo, Ping</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</atitle><jtitle>arXiv.org</jtitle><date>2023-11-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2894147569
source	Publicly Available Content Database
subjects	Alignment Automation Benchmarks Cognition Cognition & reasoning Cognitive tasks Intelligence Large language models Questions Reasoning Vision
title	Large Language Models as Automated Aligners for benchmarking Vision-Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T11%3A05%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Large%20Language%20Models%20as%20Automated%20Aligners%20for%20benchmarking%20Vision-Language%20Models&rft.jtitle=arXiv.org&rft.au=Ji,%20Yuanfeng&rft.date=2023-11-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2894147569%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28941475693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2894147569&rft_id=info:pmid/&rfr_iscdi=true