Loading…
Detect2Interact: Localizing Object Key Field in Visual Question Answering with LLMs
Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spat...
Saved in:
Published in: | IEEE intelligent systems 2024-05, Vol.39 (3), p.35-44 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Localization plays a crucial role in enhancing the practicality and precision of visual question answering (VQA) systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spatially accurate responses. In this article, we introduce “Detect2Interact,” which addresses the challenges in accurately mapping objects within images to generate nuanced and spatially aware responses by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4’s commonsense knowledge. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation. |
---|---|
ISSN: | 1541-1672 1941-1294 |
DOI: | 10.1109/MIS.2024.3384513 |