Loading…

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-11
Main Authors: Ji, Yuanfeng, Ge, Chongjian, Kong, Weikai, Xie, Enze, Liu, Zhengying, Li, Zhengguo, Luo, Ping
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Ji, Yuanfeng
Ge, Chongjian
Kong, Weikai
Xie, Enze
Liu, Zhengying
Li, Zhengguo
Luo, Ping
description With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2894147569</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2894147569</sourcerecordid><originalsourceid>FETCH-proquest_journals_28941475693</originalsourceid><addsrcrecordid>eNqNi70KwjAURoMgWLTvcMG50Cbp31hEcaiT4lqiTWNqm2hu8_52cHJy-g6c8y1IQBlLooJTuiIhYh_HMc1ymqYsIOdaOCWhFkZ5McPJtnJAEAiVn-woJtlCNWhlpEPorIObNPfHKNxTGwVXjdqa6Oe9IctODCjD767J9rC_7I7Ry9m3lzg1vfXOzKqhRckTnqdZyf6rPgM0P-M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2894147569</pqid></control><display><type>article</type><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><source>Publicly Available Content Database</source><creator>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</creator><creatorcontrib>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</creatorcontrib><description>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Automation ; Benchmarks ; Cognition ; Cognition &amp; reasoning ; Cognitive tasks ; Intelligence ; Large language models ; Questions ; Reasoning ; Vision</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2894147569?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Ji, Yuanfeng</creatorcontrib><creatorcontrib>Ge, Chongjian</creatorcontrib><creatorcontrib>Kong, Weikai</creatorcontrib><creatorcontrib>Xie, Enze</creatorcontrib><creatorcontrib>Liu, Zhengying</creatorcontrib><creatorcontrib>Li, Zhengguo</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><title>arXiv.org</title><description>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</description><subject>Alignment</subject><subject>Automation</subject><subject>Benchmarks</subject><subject>Cognition</subject><subject>Cognition &amp; reasoning</subject><subject>Cognitive tasks</subject><subject>Intelligence</subject><subject>Large language models</subject><subject>Questions</subject><subject>Reasoning</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNi70KwjAURoMgWLTvcMG50Cbp31hEcaiT4lqiTWNqm2hu8_52cHJy-g6c8y1IQBlLooJTuiIhYh_HMc1ymqYsIOdaOCWhFkZ5McPJtnJAEAiVn-woJtlCNWhlpEPorIObNPfHKNxTGwVXjdqa6Oe9IctODCjD767J9rC_7I7Ry9m3lzg1vfXOzKqhRckTnqdZyf6rPgM0P-M</recordid><startdate>20231124</startdate><enddate>20231124</enddate><creator>Ji, Yuanfeng</creator><creator>Ge, Chongjian</creator><creator>Kong, Weikai</creator><creator>Xie, Enze</creator><creator>Liu, Zhengying</creator><creator>Li, Zhengguo</creator><creator>Luo, Ping</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231124</creationdate><title>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</title><author>Ji, Yuanfeng ; Ge, Chongjian ; Kong, Weikai ; Xie, Enze ; Liu, Zhengying ; Li, Zhengguo ; Luo, Ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28941475693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Alignment</topic><topic>Automation</topic><topic>Benchmarks</topic><topic>Cognition</topic><topic>Cognition &amp; reasoning</topic><topic>Cognitive tasks</topic><topic>Intelligence</topic><topic>Large language models</topic><topic>Questions</topic><topic>Reasoning</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Ji, Yuanfeng</creatorcontrib><creatorcontrib>Ge, Chongjian</creatorcontrib><creatorcontrib>Kong, Weikai</creatorcontrib><creatorcontrib>Xie, Enze</creatorcontrib><creatorcontrib>Liu, Zhengying</creatorcontrib><creatorcontrib>Li, Zhengguo</creatorcontrib><creatorcontrib>Luo, Ping</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ji, Yuanfeng</au><au>Ge, Chongjian</au><au>Kong, Weikai</au><au>Xie, Enze</au><au>Liu, Zhengying</au><au>Li, Zhengguo</au><au>Luo, Ping</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Large Language Models as Automated Aligners for benchmarking Vision-Language Models</atitle><jtitle>arXiv.org</jtitle><date>2023-11-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_2894147569
source Publicly Available Content Database
subjects Alignment
Automation
Benchmarks
Cognition
Cognition & reasoning
Cognitive tasks
Intelligence
Large language models
Questions
Reasoning
Vision
title Large Language Models as Automated Aligners for benchmarking Vision-Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T11%3A05%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Large%20Language%20Models%20as%20Automated%20Aligners%20for%20benchmarking%20Vision-Language%20Models&rft.jtitle=arXiv.org&rft.au=Ji,%20Yuanfeng&rft.date=2023-11-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2894147569%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28941475693%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2894147569&rft_id=info:pmid/&rfr_iscdi=true