Loading…

Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing

Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obt...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kasagi, Akihiko, Asaoka, Masahiro, Tabuchi, Akihiro, Oyama, Yosuke, Honda, Takumi, Sakai, Yasufumi, Dang, Thang, Tabaru, Tsuguchika
Format:	Conference Proceeding
Language:	English
Subjects:	Bert Bit error rate Blogs Computational modeling Deep Learning Graphics processing units Memory management NLP Social networking (online) Training
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	113
container_issue
container_start_page	108
container_title
container_volume
creator	Kasagi, Akihiko Asaoka, Masahiro Tabuchi, Akihiro Oyama, Yosuke Honda, Takumi Sakai, Yasufumi Dang, Thang Tabaru, Tsuguchika
description	Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.
doi_str_mv	10.1109/CANDAR53791.2021.00022
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9643927</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9643927</ieee_id><sourcerecordid>9643927</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</originalsourceid><addsrcrecordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><source>IEEE Xplore All Conference Series</source><creator>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creator><creatorcontrib>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creatorcontrib><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><identifier>EISSN: 2379-1896</identifier><identifier>EISBN: 1665442468</identifier><identifier>EISBN: 9781665442466</identifier><identifier>DOI: 10.1109/CANDAR53791.2021.00022</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bert ; Bit error rate ; Blogs ; Computational modeling ; Deep Learning ; Graphics processing units ; Memory management ; NLP ; Social networking (online) ; Training</subject><ispartof>2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><title>2021 Ninth International Symposium on Computing and Networking (CANDAR)</title><addtitle>CANDAR</addtitle><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><subject>Bert</subject><subject>Bit error rate</subject><subject>Blogs</subject><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Graphics processing units</subject><subject>Memory management</subject><subject>NLP</subject><subject>Social networking (online)</subject><subject>Training</subject><issn>2379-1896</issn><isbn>1665442468</isbn><isbn>9781665442466</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</recordid><startdate>202111</startdate><enddate>202111</enddate><creator>Kasagi, Akihiko</creator><creator>Asaoka, Masahiro</creator><creator>Tabuchi, Akihiro</creator><creator>Oyama, Yosuke</creator><creator>Honda, Takumi</creator><creator>Sakai, Yasufumi</creator><creator>Dang, Thang</creator><creator>Tabaru, Tsuguchika</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202111</creationdate><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><author>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Bert</topic><topic>Bit error rate</topic><topic>Blogs</topic><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Graphics processing units</topic><topic>Memory management</topic><topic>NLP</topic><topic>Social networking (online)</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kasagi, Akihiko</au><au>Asaoka, Masahiro</au><au>Tabuchi, Akihiro</au><au>Oyama, Yosuke</au><au>Honda, Takumi</au><au>Sakai, Yasufumi</au><au>Dang, Thang</au><au>Tabaru, Tsuguchika</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</atitle><btitle>2021 Ninth International Symposium on Computing and Networking (CANDAR)</btitle><stitle>CANDAR</stitle><date>2021-11</date><risdate>2021</risdate><spage>108</spage><epage>113</epage><pages>108-113</pages><eissn>2379-1896</eissn><eisbn>1665442468</eisbn><eisbn>9781665442466</eisbn><coden>IEEPAD</coden><abstract>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</abstract><pub>IEEE</pub><doi>10.1109/CANDAR53791.2021.00022</doi><tpages>6</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-1896
ispartof	2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113
issn	2379-1896
language	eng
recordid	cdi_ieee_primary_9643927
source	IEEE Xplore All Conference Series
subjects	Bert Bit error rate Blogs Computational modeling Deep Learning Graphics processing units Memory management NLP Social networking (online) Training
title	Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T06%3A18%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Efficient%20and%20Large%20Scale%20Pre-training%20Techniques%20for%20Japanese%20Natural%20Language%20Processing&rft.btitle=2021%20Ninth%20International%20Symposium%20on%20Computing%20and%20Networking%20(CANDAR)&rft.au=Kasagi,%20Akihiko&rft.date=2021-11&rft.spage=108&rft.epage=113&rft.pages=108-113&rft.eissn=2379-1896&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CANDAR53791.2021.00022&rft.eisbn=1665442468&rft.eisbn_list=9781665442466&rft_dat=%3Cieee_CHZPO%3E9643927%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9643927&rfr_iscdi=true