Loading…

Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing

Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obt...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kasagi, Akihiko, Asaoka, Masahiro, Tabuchi, Akihiro, Oyama, Yosuke, Honda, Takumi, Sakai, Yasufumi, Dang, Thang, Tabaru, Tsuguchika
Format:	Conference Proceeding
Language:	English
Subjects:	Bert Bit error rate Blogs Computational modeling Deep Learning Graphics processing units Memory management NLP Social networking (online) Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.
ISSN:	2379-1896
DOI:	10.1109/CANDAR53791.2021.00022