Loading…

Improved Knowledge Distillation via Teacher Assistants for Sentiment Analysis

Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art results on various NLP tasks. However, the size of BERT makes application in time-sensitive scenarios challenging. There are lines of research compressing BERT through different techniques and Knowledge Dist...

Full description

Saved in:

Bibliographic Details
Main Authors:	Dong, Ximing, Huang, Olive, Thulasiraman, Parimala, Mahanti, Aniket
Format:	Conference Proceeding
Language:	English
Subjects:	Analytical models BERT Bidirectional control Computational modeling Education knowledge distillation Machine learning Neural networks Sentiment analysis teacher assistant teacher-student network
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art results on various NLP tasks. However, the size of BERT makes application in time-sensitive scenarios challenging. There are lines of research compressing BERT through different techniques and Knowledge Distillation (KD) is the most popular. Nevertheless, more recent studies challenge the effectiveness of KD from an arbitrarily large teacher model. So far, research on the negative impact of the teacher-student gap on the effectiveness of knowledge transfer has been confined mainly to computer vision. Additionally, those researches were limited to distillations between teachers and students with similar model architectures. To fill the gap in the literature, we implemented a teacher assistant (TA) model lying between a fine-tuned BERT model and non-transformer-based machine learning models, including CNN and Bi-LSTM, for sentiment analysis. We have shown that teaching-assistant-facilitated KD outperformed traditional KD while maintaining a competitive inference efficiency. In particular, a well-designed CNN model could retain 97% of BERT's performance while being 1410x smaller for sentiment analysis. We have also found that BERT is not necessarily a better teacher model than non-transformer-based neural networks.
ISSN:	2472-8322
DOI:	10.1109/SSCI52147.2023.10371965