Loading…
Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing
Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obt...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 113 |
container_issue | |
container_start_page | 108 |
container_title | |
container_volume | |
creator | Kasagi, Akihiko Asaoka, Masahiro Tabuchi, Akihiro Oyama, Yosuke Honda, Takumi Sakai, Yasufumi Dang, Thang Tabaru, Tsuguchika |
description | Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis. |
doi_str_mv | 10.1109/CANDAR53791.2021.00022 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9643927</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9643927</ieee_id><sourcerecordid>9643927</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</originalsourceid><addsrcrecordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><source>IEEE Xplore All Conference Series</source><creator>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creator><creatorcontrib>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</creatorcontrib><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><identifier>EISSN: 2379-1896</identifier><identifier>EISBN: 1665442468</identifier><identifier>EISBN: 9781665442466</identifier><identifier>DOI: 10.1109/CANDAR53791.2021.00022</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bert ; Bit error rate ; Blogs ; Computational modeling ; Deep Learning ; Graphics processing units ; Memory management ; NLP ; Social networking (online) ; Training</subject><ispartof>2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9643927$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><title>2021 Ninth International Symposium on Computing and Networking (CANDAR)</title><addtitle>CANDAR</addtitle><description>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</description><subject>Bert</subject><subject>Bit error rate</subject><subject>Blogs</subject><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Graphics processing units</subject><subject>Memory management</subject><subject>NLP</subject><subject>Social networking (online)</subject><subject>Training</subject><issn>2379-1896</issn><isbn>1665442468</isbn><isbn>9781665442466</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81OwzAQhA0SEqX0CZCQXyBlbW_c5FiV8qeoICgXLtXG2QSj4hY7PfD2BMFpDt98I40QlwqmSkF5tZivrufPuZmVaqpBqykAaH0kzpS1OaJGWxyLkR54porSnopJSh9Dx2hAsGYk3pZt653n0EsKjawodixfHG1ZPkXO-kg--NDJNbv34L8OnGS7i_KB9hQ4sVxRf4i0HcTQHaj7tXaOUxqcc3HS0jbx5D_H4vVmuV7cZdXj7f1iXmVeg-kza1Hl2OT58AeUnREi565QXNS2xrbUtTXsoKmRjUVEcMq4AizV2GCrGzMWF3-7npk3--g_KX5vSoum1DPzA8COUxo</recordid><startdate>202111</startdate><enddate>202111</enddate><creator>Kasagi, Akihiko</creator><creator>Asaoka, Masahiro</creator><creator>Tabuchi, Akihiro</creator><creator>Oyama, Yosuke</creator><creator>Honda, Takumi</creator><creator>Sakai, Yasufumi</creator><creator>Dang, Thang</creator><creator>Tabaru, Tsuguchika</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202111</creationdate><title>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</title><author>Kasagi, Akihiko ; Asaoka, Masahiro ; Tabuchi, Akihiro ; Oyama, Yosuke ; Honda, Takumi ; Sakai, Yasufumi ; Dang, Thang ; Tabaru, Tsuguchika</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Bert</topic><topic>Bit error rate</topic><topic>Blogs</topic><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Graphics processing units</topic><topic>Memory management</topic><topic>NLP</topic><topic>Social networking (online)</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Kasagi, Akihiko</creatorcontrib><creatorcontrib>Asaoka, Masahiro</creatorcontrib><creatorcontrib>Tabuchi, Akihiro</creatorcontrib><creatorcontrib>Oyama, Yosuke</creatorcontrib><creatorcontrib>Honda, Takumi</creatorcontrib><creatorcontrib>Sakai, Yasufumi</creatorcontrib><creatorcontrib>Dang, Thang</creatorcontrib><creatorcontrib>Tabaru, Tsuguchika</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kasagi, Akihiko</au><au>Asaoka, Masahiro</au><au>Tabuchi, Akihiro</au><au>Oyama, Yosuke</au><au>Honda, Takumi</au><au>Sakai, Yasufumi</au><au>Dang, Thang</au><au>Tabaru, Tsuguchika</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing</atitle><btitle>2021 Ninth International Symposium on Computing and Networking (CANDAR)</btitle><stitle>CANDAR</stitle><date>2021-11</date><risdate>2021</risdate><spage>108</spage><epage>113</epage><pages>108-113</pages><eissn>2379-1896</eissn><eisbn>1665442468</eisbn><eisbn>9781665442466</eisbn><coden>IEEPAD</coden><abstract>Pre-training in natural language processing greatly affects the accuracy of downstream tasks. However, pre-training is a bottleneck in the AI system development process because it takes a long time to train the neural network model by using large-scale input data. Our purpose in this paper is to obtain a highly accurate pre-training model in a short time using a large-scale computation environment. Since reducing the time per iteration is difficult even if using large number of computation nodes, it is necessary to reduce the number of iterations. Therefore, we focus on the learning efficiency per iteration and choice a dense Masked Language Model (MLM) task of pretraining in order to utilize a significant power of large-scale cluster. We implemented BERT-xlarge using the dense MLM on Megatron-LM and evaluated the improvement of learning time and learning efficiency for a Japanese language dataset using 768 GPUs on AI Bridging Cloud Infrastructure (ABCI). Our BERT-xlarge improves the learning efficiency per iteration 10 times and completes pre-training in 4.65 hours. This pre-training takes 4.9 months if we use a single GPU. We also evaluated two fine-tunings, JSNLI and Twitter evaluation analysis, to compare the accuracy of downstream tasks with our BERTs and other BERTs. As a result, our BERT-3.9b achieved 94.30% accuracy of JSNLI, and our BERT-xlarge achieved 90.63% accuracy of Twitter evaluation analysis.</abstract><pub>IEEE</pub><doi>10.1109/CANDAR53791.2021.00022</doi><tpages>6</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2379-1896 |
ispartof | 2021 Ninth International Symposium on Computing and Networking (CANDAR), 2021, p.108-113 |
issn | 2379-1896 |
language | eng |
recordid | cdi_ieee_primary_9643927 |
source | IEEE Xplore All Conference Series |
subjects | Bert Bit error rate Blogs Computational modeling Deep Learning Graphics processing units Memory management NLP Social networking (online) Training |
title | Efficient and Large Scale Pre-training Techniques for Japanese Natural Language Processing |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T06%3A18%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Efficient%20and%20Large%20Scale%20Pre-training%20Techniques%20for%20Japanese%20Natural%20Language%20Processing&rft.btitle=2021%20Ninth%20International%20Symposium%20on%20Computing%20and%20Networking%20(CANDAR)&rft.au=Kasagi,%20Akihiko&rft.date=2021-11&rft.spage=108&rft.epage=113&rft.pages=108-113&rft.eissn=2379-1896&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CANDAR53791.2021.00022&rft.eisbn=1665442468&rft.eisbn_list=9781665442466&rft_dat=%3Cieee_CHZPO%3E9643927%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-664154d551090167a44e5c81e8b6b4f92b63ec0db4e364440c13c806ab4d4f2d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9643927&rfr_iscdi=true |