Loading…

Evaluating Multimodal Large Language Models across Distribution Shifts and Augmentations

Foundational models such as Multimodal Large Language Models (MLLMs) with their ability to interpret images and generate intricate responses has led to their widespread adoption across multiple computer vision and natural language processing tasks. However, they suffer from hallucinations and strugg...

Full description

Saved in:
Bibliographic Details
Main Authors: Verma, Aayush Atul, Saeidi, Amir, Hegde, Shamanthak, Therala, Ajay, Bardoliya, Fenil Denish, Machavarapu, Nagaraju, Ravindhiran, Shri Ajay Kumar, Malyala, Srija, Chatterjee, Agneet, Yang, Yezhou, Baral, Chitta
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Foundational models such as Multimodal Large Language Models (MLLMs) with their ability to interpret images and generate intricate responses has led to their widespread adoption across multiple computer vision and natural language processing tasks. However, they suffer from hallucinations and struggle to reason over complex reasoning tasks. In this work, we evaluate the performance of MLLMs across multiple multimodal augmentations and evaluate their performance in out-of-distribution settings. We benchmark 3 models, across 2 vision-language datasets, VQAv2 and CLEVR, and assess their performance across adversarial transformations in both the vision and language modalities. We introduce image perturbations using various augmentations, including noise addition, blurring, and median filtering and generate adversarial questions which contain conjunctions, disjunctions and negations. Additionally, we conduct a detailed fine-grained analysis to assess the model's performance on particular question categories, such as those related to shape and color, across images featuring identical or varying objects. Our findings indicate a notable decrease in the performance of current MLLMs for synthetic images, with a gradual decline observed across both vision and language augmentations. Specifically, Gaussian Noise Addition emerges as the most detrimental augmentation, and we observe a significant drop in performance with complex questions containing multiple connectives. In these times of rapid development and deployment of MLLMs in real-world settings, we believe our findings are a first step towards benchmarking the robustness and out-of-distribution behavior of such models.
ISSN:2160-7516
DOI:10.1109/CVPRW63382.2024.00540