Loading…
Vision-Language Models in Remote Sensing: Current progress and future trends
The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, e...
Saved in:
Published in: | IEEE geoscience and remote sensing magazine 2024-06, Vol.12 (2), p.32-66 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, enabling us to use general artificial intelligence (AI) to solve problems in various applications. However, in the field of remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in RS focuses primarily on visual-understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-LMs (VLMs) excel as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. VLMs can go beyond visual recognition of RS images and can model semantic relationships as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning and visual question answering (VQA). This article provides a comprehensive review of the research on VLMs in RS, summarizing the latest progress, highlighting current challenges, and identifying potential research opportunities. Specifically, we review the application of VLMs in mainstream RS tasks, including image captioning, text-based image generation, text-based image retrieval (TBIR), VQA, scene classification, semantic segmentation, and object detection. For each task, we analyze representative works and discuss research progress. Finally, we summarize the limitations of existing works and provide possible directions for future development. This review aims to provide a comprehensive overview of the current research progress of VLMs in RS (see Figure 1 ), and to inspire further research in this exciting and promising field. |
---|---|
ISSN: | 2473-2397 2168-6831 |
DOI: | 10.1109/MGRS.2024.3383473 |