Loading…

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

We present a new benchmark and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to br...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-07
Main Authors: Yu-Chung, Hsiao, Zubach, Fedir, Baechler, Gilles, Carbune, Victor, Lin, Jason, Wang, Maria, Sunkara, Srinivas, Zhu, Yun, Chen, Jindong
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We present a new benchmark and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity. This work is also the first to annotate answers for different application scenarios, including both full sentences and short forms, as well as supporting UI contents on screen and their bounding boxes. With the rich annotation, we discuss and define the evaluation metrics of the benchmark, show applications of the dataset, and provide a few baselines using closed and open source models.
ISSN:2331-8422