Loading…

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-11
Main Authors:	Rawte, Vipula, Jain, Sarthak, Sinha, Aarush, Kaushik, Garv, Bansal, Aman, Vishwanath, Prathiksha Rumale, Samyak Rajesh Jain, Reganti, Aishwarya Naresh, Jain, Vinija, Chadha, Aman, Sheth, Amit P, Das, Amitava
Format:	Article
Language:	English
Subjects:	Benchmarks Configuration management Hallucinations Performance evaluation Video
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Rawte, Vipula Jain, Sarthak Sinha, Aarush Kaushik, Garv Bansal, Aman Vishwanath, Prathiksha Rumale Samyak Rajesh Jain Reganti, Aishwarya Naresh Jain, Vinija Chadha, Aman Sheth, Amit P Das, Amitava
description	Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3130500810</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3130500810</sourcerecordid><originalsourceid>FETCH-proquest_journals_31305008103</originalsourceid><addsrcrecordid>eNqNik8LgjAcQEcQJOV3-EFnYW5Z0i3D8JA3saMMnTabW-1P9PHz0Afo9Hi8t0ABoTSO0h0hKxRaO2KMyf5AkoQG6FaLjB_hBBX_uMjpqBYd15Bx1d4nZh7QawP5m0nPnFADFExK3wo1m1YgFFyZGTiUXjox6Y5JKHXHpd2gZc-k5eGPa7S95NW5iJ5Gvzy3rhm1N2pODY0pTjBOY0z_u76kL0Br</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130500810</pqid></control><display><type>article</type><title>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</title><source>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</source><creator>Rawte, Vipula ; Jain, Sarthak ; Sinha, Aarush ; Kaushik, Garv ; Bansal, Aman ; Vishwanath, Prathiksha Rumale ; Samyak Rajesh Jain ; Reganti, Aishwarya Naresh ; Jain, Vinija ; Chadha, Aman ; Sheth, Amit P ; Das, Amitava</creator><creatorcontrib>Rawte, Vipula ; Jain, Sarthak ; Sinha, Aarush ; Kaushik, Garv ; Bansal, Aman ; Vishwanath, Prathiksha Rumale ; Samyak Rajesh Jain ; Reganti, Aishwarya Naresh ; Jain, Vinija ; Chadha, Aman ; Sheth, Amit P ; Das, Amitava</creatorcontrib><description>Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Configuration management ; Hallucinations ; Performance evaluation ; Video</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3130500810?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Rawte, Vipula</creatorcontrib><creatorcontrib>Jain, Sarthak</creatorcontrib><creatorcontrib>Sinha, Aarush</creatorcontrib><creatorcontrib>Kaushik, Garv</creatorcontrib><creatorcontrib>Bansal, Aman</creatorcontrib><creatorcontrib>Vishwanath, Prathiksha Rumale</creatorcontrib><creatorcontrib>Samyak Rajesh Jain</creatorcontrib><creatorcontrib>Reganti, Aishwarya Naresh</creatorcontrib><creatorcontrib>Jain, Vinija</creatorcontrib><creatorcontrib>Chadha, Aman</creatorcontrib><creatorcontrib>Sheth, Amit P</creatorcontrib><creatorcontrib>Das, Amitava</creatorcontrib><title>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</title><title>arXiv.org</title><description>Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</description><subject>Benchmarks</subject><subject>Configuration management</subject><subject>Hallucinations</subject><subject>Performance evaluation</subject><subject>Video</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNik8LgjAcQEcQJOV3-EFnYW5Z0i3D8JA3saMMnTabW-1P9PHz0Afo9Hi8t0ABoTSO0h0hKxRaO2KMyf5AkoQG6FaLjB_hBBX_uMjpqBYd15Bx1d4nZh7QawP5m0nPnFADFExK3wo1m1YgFFyZGTiUXjox6Y5JKHXHpd2gZc-k5eGPa7S95NW5iJ5Gvzy3rhm1N2pODY0pTjBOY0z_u76kL0Br</recordid><startdate>20241116</startdate><enddate>20241116</enddate><creator>Rawte, Vipula</creator><creator>Jain, Sarthak</creator><creator>Sinha, Aarush</creator><creator>Kaushik, Garv</creator><creator>Bansal, Aman</creator><creator>Vishwanath, Prathiksha Rumale</creator><creator>Samyak Rajesh Jain</creator><creator>Reganti, Aishwarya Naresh</creator><creator>Jain, Vinija</creator><creator>Chadha, Aman</creator><creator>Sheth, Amit P</creator><creator>Das, Amitava</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241116</creationdate><title>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</title><author>Rawte, Vipula ; Jain, Sarthak ; Sinha, Aarush ; Kaushik, Garv ; Bansal, Aman ; Vishwanath, Prathiksha Rumale ; Samyak Rajesh Jain ; Reganti, Aishwarya Naresh ; Jain, Vinija ; Chadha, Aman ; Sheth, Amit P ; Das, Amitava</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31305008103</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Configuration management</topic><topic>Hallucinations</topic><topic>Performance evaluation</topic><topic>Video</topic><toplevel>online_resources</toplevel><creatorcontrib>Rawte, Vipula</creatorcontrib><creatorcontrib>Jain, Sarthak</creatorcontrib><creatorcontrib>Sinha, Aarush</creatorcontrib><creatorcontrib>Kaushik, Garv</creatorcontrib><creatorcontrib>Bansal, Aman</creatorcontrib><creatorcontrib>Vishwanath, Prathiksha Rumale</creatorcontrib><creatorcontrib>Samyak Rajesh Jain</creatorcontrib><creatorcontrib>Reganti, Aishwarya Naresh</creatorcontrib><creatorcontrib>Jain, Vinija</creatorcontrib><creatorcontrib>Chadha, Aman</creatorcontrib><creatorcontrib>Sheth, Amit P</creatorcontrib><creatorcontrib>Das, Amitava</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rawte, Vipula</au><au>Jain, Sarthak</au><au>Sinha, Aarush</au><au>Kaushik, Garv</au><au>Bansal, Aman</au><au>Vishwanath, Prathiksha Rumale</au><au>Samyak Rajesh Jain</au><au>Reganti, Aishwarya Naresh</au><au>Jain, Vinija</au><au>Chadha, Aman</au><au>Sheth, Amit P</au><au>Das, Amitava</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</atitle><jtitle>arXiv.org</jtitle><date>2024-11-16</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3130500810
source	Publicly Available Content Database (Proquest) (PQ_SDU_P3)
subjects	Benchmarks Configuration management Hallucinations Performance evaluation Video
title	ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T11%3A17%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ViBe:%20A%20Text-to-Video%20Benchmark%20for%20Evaluating%20Hallucination%20in%20Large%20Multimodal%20Models&rft.jtitle=arXiv.org&rft.au=Rawte,%20Vipula&rft.date=2024-11-16&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3130500810%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31305008103%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3130500810&rft_id=info:pmid/&rfr_iscdi=true