Loading…

From plane to hierarchy: Deformable Transformer for Remote Sensing Image Captioning

With the growth of remote sensing images, un-derstanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with CNN-RNN as the backbone and supplemented by attention has been wid...

Full description

Saved in:
Bibliographic Details
Published in:IEEE journal of selected topics in applied earth observations and remote sensing 2023-01, Vol.16, p.1-14
Main Authors: Du, Runyan, Cao, Wei, Zhang, Wenkai, Zhi, Guo, Sun, Xian, Li, Shuoke, Li, Jihao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the growth of remote sensing images, un-derstanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with CNN-RNN as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional LSTM in sentence generation. For solving the above problems, in this paper, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the fore- and background. Based on hierarchical image informa-tion, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multi-scale feature from fore- and background through the powerful interactive learning ability. Evaluations are conducted on Four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.
ISSN:1939-1404
2151-1535
DOI:10.1109/JSTARS.2023.3305889