Loading…

LingoQA: Visual Question Answering for Autonomous Driving

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabi...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-09
Main Authors:	Marcu, Ana-Maria, Long, Chen, Hünermann, Jan, Karnsund, Alice, Hanotte, Benoit, Prajwal Chidananda, Nair, Saurabh, Badrinarayanan, Vijay, Kendall, Alex, Shotton, Jamie, Arani, Elahe, Sinavski, Oleg
Format:	Article
Language:	English
Subjects:	Ablation Benchmarks Correlation coefficients Performance evaluation Questions
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
ISSN:	2331-8422