Loading…
Improved Knowledge Distillation via Teacher Assistants for Sentiment Analysis
Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art results on various NLP tasks. However, the size of BERT makes application in time-sensitive scenarios challenging. There are lines of research compressing BERT through different techniques and Knowledge Dist...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Bidirectional Encoder Representations from Transformers (BERT) has achieved state-of-the-art results on various NLP tasks. However, the size of BERT makes application in time-sensitive scenarios challenging. There are lines of research compressing BERT through different techniques and Knowledge Distillation (KD) is the most popular. Nevertheless, more recent studies challenge the effectiveness of KD from an arbitrarily large teacher model. So far, research on the negative impact of the teacher-student gap on the effectiveness of knowledge transfer has been confined mainly to computer vision. Additionally, those researches were limited to distillations between teachers and students with similar model architectures. To fill the gap in the literature, we implemented a teacher assistant (TA) model lying between a fine-tuned BERT model and non-transformer-based machine learning models, including CNN and Bi-LSTM, for sentiment analysis. We have shown that teaching-assistant-facilitated KD outperformed traditional KD while maintaining a competitive inference efficiency. In particular, a well-designed CNN model could retain 97% of BERT's performance while being 1410x smaller for sentiment analysis. We have also found that BERT is not necessarily a better teacher model than non-transformer-based neural networks. |
---|---|
ISSN: | 2472-8322 |
DOI: | 10.1109/SSCI52147.2023.10371965 |