Loading…
Neighbor patches merging reduces spatial redundancy to accelerate vision transformer
Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equ...
Saved in:
Published in: | Neurocomputing (Amsterdam) 2025-01, Vol.613, p.128733, Article 128733 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Vision Transformers (ViTs) deliver outstanding performance but often require substantial computational resources. Various token pruning methods have been developed to enhance throughput by removing redundant tokens; however, these methods do not address the peak memory consumption, which remains equivalent to that of the unpruned networks. In this study, we introduce Neighbor Patches Merging (NEPAM), a method that significantly reduces the maximum memory footprint of ViTs while pruning tokens. NEPAM targets spatial redundancy within images and prunes redundant patches at the onset of the model, thereby achieving the optimal throughput-accuracy trade-off without fine-tuning. Experimental results demonstrate that NEPAM can accelerate the inference speed of the Vit-Base-Patch16-384 model by 25% with a negligible accuracy loss of 0.07% and a notable 18% reduction in memory usage. When applied to VideoMAE, NEPAM doubles the throughput with a 0.29% accuracy loss and a 48% reduction in memory usage. These findings underscore the efficacy of NEPAM in mitigating computational requirements while maintaining model performance. |
---|---|
ISSN: | 0925-2312 |
DOI: | 10.1016/j.neucom.2024.128733 |