Loading…
Q-Bench ^++: A Benchmark for Multi-Modal Foundation Models on Low-Level Vision From Single Images to Pairs
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design be...
Saved in:
Published in: | IEEE transactions on pattern analysis and machine intelligence 2024-12, Vol.46 (12), p.10404-10418 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception ( A1 ) via visual question answering related to low-level attributes ( e.g. clarity, lighting ); and the low-level visual description ( A2 ), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs . Specifically, for perception (A1), we carry out the LLVisionQA^{+} + dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe^{+} + dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations ( like humans ). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. |
---|---|
ISSN: | 0162-8828 1939-3539 1939-3539 2160-9292 |
DOI: | 10.1109/TPAMI.2024.3445770 |