Loading…
Multimodal Recognition of Landmarks Based on Vision Language Model
In the task of landmark recognition, the recognition accuracy is easily affected by factors such as lighting, angle, and season. Existing landmark recognition methods have poor ability to understand deep semantics and capture contextual relationships, resulting in low recognition accuracy. Moreover,...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In the task of landmark recognition, the recognition accuracy is easily affected by factors such as lighting, angle, and season. Existing landmark recognition methods have poor ability to understand deep semantics and capture contextual relationships, resulting in low recognition accuracy. Moreover, some methods require complex preprocessing steps. To address the above issues, this paper proposes a multimodal landmark recognition method based on vision language model. The vision language pretraining model is used to make the model training more stable and accelerate the model convergence. The self-attention layer based on multi-scale fusion is used to adapt image features. Multimodal marker vectors are introduced to fine-tune prompt words, and feature fusion is carried out using gated cross-attention layer to make the model better understand the image-text correlation and improve the model prediction accuracy. The results of testing on the dataset show that the proposed method is superior to the current mainstream methods in terms of accuracy, and the accuracy of Top-1 and Top-5 can reach 80.72% and 92.16%. |
---|---|
ISSN: | 2833-2423 |
DOI: | 10.1109/CISCE62493.2024.10653125 |