Loading…

Detect2Interact: Localizing Object Key Field in Visual Question Answering with LLMs

Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spat...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE intelligent systems 2024-05, Vol.39 (3), p.35-44
Main Authors:	Wang, Jialou, Zhu, Manli, Li, Yulei, Li, Honglei, Yang, Longzhi, Woo, Wai Lok
Format:	Article
Language:	English
Subjects:	Chatbots Computational modeling Image segmentation Large language models Object detection Object recognition Question answering (information retrieval) Questions Semantics Spatial resolution Task analysis Visual fields Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spatially accurate responses. In this article, we introduce “Detect2Interact,” which addresses the challenges in accurately mapping objects within images to generate nuanced and spatially aware responses by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4’s commonsense knowledge. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.
ISSN:	1541-1672 1941-1294
DOI:	10.1109/MIS.2024.3384513