Loading…

Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing

Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obt...

Full description

Saved in:
Bibliographic Details
Main Authors: Kasagi, Akihiko, Asaoka, Masahiro, Tabuchi, Akihiro, Oyama, Yosuke, Honda, Takumi, Sakai, Yasufumi, Dang, Thang, Tabaru, Tsuguchika
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 113
container_issue
container_start_page 108
container_title
container_volume
creator Kasagi, Akihiko
Asaoka, Masahiro
Tabuchi, Akihiro
Oyama, Yosuke
Honda, Takumi
Sakai, Yasufumi
Dang, Thang
Tabaru, Tsuguchika
description Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.
doi_str_mv 10.1109/CANDAR53791.2021.00022
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9643927</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9643927</ieee_id><sourcerecordid>9643927</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</originalsourceid><addsrcrecordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><source>IEEE Xplore All Conference Series</source><creator>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creator><creatorcontrib>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creatorcontrib><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><identifier>EISSN: 2379-1896</identifier><identifier>EISBN: 1665442468</identifier><identifier>EISBN: 9781665442466</identifier><identifier>DOI: 10.1109/CANDAR53791.2021.00022</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bert ; Bit error rate ; Blogs ; Computational modeling ; Deep Learning ; Graphics processing units ; Memory management ; NLP ; Social networking (online) ; Training</subject><ispartof>2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><title>2021 Ninth International Symposium on Computing and Networking (CANDAR)</title><addtitle>CANDAR</addtitle><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><subject>Bert</subject><subject>Bit error rate</subject><subject>Blogs</subject><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Graphics processing units</subject><subject>Memory management</subject><subject>NLP</subject><subject>Social networking (online)</subject><subject>Training</subject><issn>2379-1896</issn><isbn>1665442468</isbn><isbn>9781665442466</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</recordid><startdate>202111</startdate><enddate>202111</enddate><creator>Kasagi, Akihiko</creator><creator>Asaoka, Masahiro</creator><creator>Tabuchi, Akihiro</creator><creator>Oyama, Yosuke</creator><creator>Honda, Takumi</creator><creator>Sakai, Yasufumi</creator><creator>Dang, Thang</creator><creator>Tabaru, Tsuguchika</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202111</creationdate><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><author>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Bert</topic><topic>Bit error rate</topic><topic>Blogs</topic><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Graphics processing units</topic><topic>Memory management</topic><topic>NLP</topic><topic>Social networking (online)</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kasagi, Akihiko</au><au>Asaoka, Masahiro</au><au>Tabuchi, Akihiro</au><au>Oyama, Yosuke</au><au>Honda, Takumi</au><au>Sakai, Yasufumi</au><au>Dang, Thang</au><au>Tabaru, Tsuguchika</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</atitle><btitle>2021 Ninth International Symposium on Computing and Networking (CANDAR)</btitle><stitle>CANDAR</stitle><date>2021-11</date><risdate>2021</risdate><spage>108</spage><epage>113</epage><pages>108-113</pages><eissn>2379-1896</eissn><eisbn>1665442468</eisbn><eisbn>9781665442466</eisbn><coden>IEEPAD</coden><abstract>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</abstract><pub>IEEE</pub><doi>10.1109/CANDAR53791.2021.00022</doi><tpages>6</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2379-1896
ispartof 2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113
issn 2379-1896
language eng
recordid cdi_ieee_primary_9643927
source IEEE Xplore All Conference Series
subjects Bert
Bit error rate
Blogs
Computational modeling
Deep Learning
Graphics processing units
Memory management
NLP
Social networking (online)
Training
title Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T06%3A18%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Efficient%20and%20Large%20Scale%20Pre-training%20Techniques%20for%20Japanese%20Natural%20Language%20Processing&rft.btitle=2021%20Ninth%20International%20Symposium%20on%20Computing%20and%20Networking%20(CANDAR)&rft.au=Kasagi,%20Akihiko&rft.date=2021-11&rft.spage=108&rft.epage=113&rft.pages=108-113&rft.eissn=2379-1896&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CANDAR53791.2021.00022&rft.eisbn=1665442468&rft.eisbn_list=9781665442466&rft_dat=%3Cieee_CHZPO%3E9643927%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9643927&rfr_iscdi=true