Loading…

Neighbor patches merging reduces spatial redundancy to accelerate vision transformer

Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equ...

Full description

Saved in:
Bibliographic Details
Published in:Neurocomputing (Amsterdam) 2025-01, Vol.613, p.128733, Article 128733
Main Authors: Jiang, Kai, Peng, Peng, Lian, Youzao, Shao, Weihui, Xu, Weisheng
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equivalent to that of the unpruned networks. In this study, we introduce Neighbor Patches Merging (NEPAM), a method that significantly reduces the maximum memory footprint of ViTs while pruning tokens. NEPAM targets spatial redundancy within images and prunes redundant patches at the onset of the model, thereby achieving the optimal throughput-accuracy trade-off without fine-tuning. Experimental results demonstrate that NEPAM can accelerate the inference speed of the Vit-Base-Patch16-384 model by 25% with a negligible accuracy loss of 0.07% and a notable 18% reduction in memory usage. When applied to VideoMAE, NEPAM doubles the throughput with a 0.29% accuracy loss and a 48% reduction in memory usage. These findings underscore the efficacy of NEPAM in mitigating computational requirements while maintaining model performance.
ISSN:0925-2312
DOI:10.1016/j.neucom.2024.128733