Loading…
Collaborative multi-knowledge distillation under the influence of softmax regression representation
Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task of knowledge transfer, and distillation location sele...
Saved in:
Published in: | Multimedia systems 2024-12, Vol.30 (6), Article 331 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task of knowledge transfer, and distillation location selection, which in turn increases the difficulty of model interpretation on the one hand, and on the other hand, there have been few works on the role of the teacher classifier in distillation. In this study, we propose a novel collaborative multi-knowledge distillation under the influence of softmax regression representation. Firstly, we propose a stage-wise logit knowledge distillation, where the teacher classifier is used as an auxiliary structure to align the features of the student and teacher models. By leveraging the teacher classifier, the student features are aligned with the teacher features in the logits space, eliminating the need for a complex feature projector that requires extensive computation to match the features between the teacher and student models. Secondly, considering the teacher classifier’s adaptability to classification features, we introduce a stage-wise feature knowledge distillation. This mechanism maps the features of the student model to a latent space with the same dimensions as the features of the teacher model, guiding the student’s features to align with the teacher’s final features using a Mean Square Error (MSE) loss. Finally, we propose a pseudo-teacher knowledge distillation loss to optimize the modeling of the deformation relationship between the student and teacher features. This loss provides additional gradient optimization information for the parameters of the feature projector. Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed model compared with the state-of-the-art methods. The code is available at https://github.com/chenKP/CMKD.git |
---|---|
ISSN: | 0942-4962 1432-1882 |
DOI: | 10.1007/s00530-024-01537-z |