Loading…

Remote Sensing Scene Classification via Second-Order Differentiable Token Transformer Network

The vision transformer has been widely applied in remote sensing image scene classification due to its excellent ability to capture global features. However, remote sensing scene images involves challenges such as scene complexity and small interclass differences. Directly utilizing the global token...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-15
Main Authors: Ni, Kang, Wu, Qianqian, Li, Sichan, Zheng, Zhizhong, Wang, Peng
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The vision transformer has been widely applied in remote sensing image scene classification due to its excellent ability to capture global features. However, remote sensing scene images involves challenges such as scene complexity and small interclass differences. Directly utilizing the global tokens of the transformer for feature learning may increase computational complexity. Therefore, constructing a distinguishable transformer network that adaptively selects tokens can effectively improve the classification performance of remote sensing scene images while considering computational complexity. Based on this, a second-order differentiable token transformer network (SDT2Net) is proposed for considering the efficacy of distinguishable statistical features and nonredundant learnable tokens of remote sensing scene images. A novel transformer block, including an efficient attention block (EAB) and differentiable token compression (DTC) mechanism, is inserted into SDT2Net for acquiring selectable token features of each scene image guided by sparse shift local features and token compression rate learning style. Furthermore, a fast token fusion (FTF) module is developed for acquiring more distinguishable token feature representations. This module utilizes the fast global covariance pooling algorithm to acquire high-order visual tokens and validates the effectiveness of classification tokens and high-order visual tokens for scene classification. Compared with other recent methods, SDT2Net achieves the most advanced performance with comparable floating point operations per second (FLOPs). The code will be available at https://github.com/RSIP-NJUPT/SDT2Net .
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3407879