Loading…

Merging Patches and Tokens: A VQA System for Remote Sensing

In this paper, we investigate the integration of transformer-based feature extractors in a Remote Sensing Visual Question Answering (RSVQA) framework. Our findings demonstrate an improvement to the baseline achieved through additional attention modules after feature extraction and using MUTAN (Multi...

Full description

Saved in:
Bibliographic Details
Main Authors: Falk, Damian, Aydin, Kaan, Scheibenreif, Linus, Borth, Damian
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this paper, we investigate the integration of transformer-based feature extractors in a Remote Sensing Visual Question Answering (RSVQA) framework. Our findings demonstrate an improvement to the baseline achieved through additional attention modules after feature extraction and using MUTAN (Multimodal Tucker Fusion). Further, we delve into the potential of multi-task learning, observing a considerable boost in performance when feature extractors are trained. Our results suggest a promising future research avenue in multitask learning for RSVQA, while also emphasizing the need for careful selection of hyperparameters per question type as well as finding the proper balance for training the shared backbone and individual classifiers simultaneously to further improve performance.
ISSN:2153-7003
DOI:10.1109/IGARSS53475.2024.10641975