Loading…

Collaborative multi-knowledge distillation under the influence of softmax regression representation

Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task of knowledge transfer, and distillation location sele...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia systems 2024-12, Vol.30 (6), Article 331
Main Authors: Zhao, Hong, Chen, Kangping, Chang, Zhaobin, Huang, Dailin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Knowledge distillation can transfer knowledge from a powerful yet cumbersome teacher model to a less-parameterized student model, thus effectively achieving model compression. Various knowledge distillation methods have mainly focused on the task of knowledge transfer, and distillation location selection, which in turn increases the difficulty of model interpretation on the one hand, and on the other hand, there have been few works on the role of the teacher classifier in distillation. In this study, we propose a novel collaborative multi-knowledge distillation under the influence of softmax regression representation. Firstly, we propose a stage-wise logit knowledge distillation, where the teacher classifier is used as an auxiliary structure to align the features of the student and teacher models. By leveraging the teacher classifier, the student features are aligned with the teacher features in the logits space, eliminating the need for a complex feature projector that requires extensive computation to match the features between the teacher and student models. Secondly, considering the teacher classifier’s adaptability to classification features, we introduce a stage-wise feature knowledge distillation. This mechanism maps the features of the student model to a latent space with the same dimensions as the features of the teacher model, guiding the student’s features to align with the teacher’s final features using a Mean Square Error (MSE) loss. Finally, we propose a pseudo-teacher knowledge distillation loss to optimize the modeling of the deformation relationship between the student and teacher features. This loss provides additional gradient optimization information for the parameters of the feature projector. Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed model compared with the state-of-the-art methods. The code is available at https://github.com/chenKP/CMKD.git
ISSN:0942-4962
1432-1882
DOI:10.1007/s00530-024-01537-z