Loading…

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise moda...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-10
Main Authors:	Li, Keliang, Yang, Zaifei, Zhao, Jiahe, Shen, Hongze, Hou, Ruibing, Chang, Hong, Shan, Shiguang, Chen, Xilin
Format:	Article
Language:	English
Subjects:	Annotations Benchmarks Datasets Image enhancement Large language models
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K. Evaluations on HERM-Bench demonstrate that HERM-7B significantly outperforms existing MLLMs across various human-centric dimensions, reflecting the current inadequacy of data annotations used in MLLM training for human-centric visual understanding. This research emphasizes the importance of specialized datasets and benchmarks in advancing the MLLMs' capabilities for human-centric understanding.
ISSN:	2331-8422