Loading…

Multi-exit self-distillation with appropriate teachers

Multi-exit architecture allows early-stop inference to reduce computational cost, which can be used in resource-constrained circumstances. Recent works combine the multi-exit architecture with self-distillation to simultaneously achieve high efficiency and decent performance at different network dep...

Full description

Saved in:

Bibliographic Details
Published in:	Frontiers of information technology & electronic engineering 2024-03, Vol.25 (4), p.585-599
Main Authors:	Sun, Wujie, Chen, Defang, Wang, Can, Ye, Deshi, Feng, Yan, Chen, Chun
Format:	Article
Language:	English
Subjects:	Communications Engineering Computational efficiency Computer Hardware Computer Science Computer Systems Organization and Communication Networks Distillation Electrical Engineering Electronics and Microelectronics Instrumentation Knowledge management Learning Networks Research Article Teachers
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Multi-exit architecture allows early-stop inference to reduce computational cost, which can be used in resource-constrained circumstances. Recent works combine the multi-exit architecture with self-distillation to simultaneously achieve high efficiency and decent performance at different network depths. However, existing methods mainly transfer knowledge from deep exits or a single ensemble to guide all exits, without considering that inappropriate learning gaps between students and teachers may degrade the model performance, especially in shallow exits. To address this issue, we propose Multi-exit self-distillation with Appropriate TEachers (MATE) to provide diverse and appropriate teacher knowledge for each exit. In MATE, multiple ensemble teachers are obtained from all exits with different trainable weights. Each exit subsequently receives knowledge from all teachers, while focusing mainly on its primary teacher to keep an appropriate gap for efficient knowledge transfer. In this way, MATE achieves diversity in knowledge distillation while ensuring learning efficiency. Experimental results on CIFAR-100, TinyImageNet, and three fine-grained datasets demonstrate that MATE consistently outperforms state-of-the-art multi-exit self-distillation methods with various network architectures.
ISSN:	2095-9184 2095-9230
DOI:	10.1631/FITEE.2200644