Loading…

Prune Once for All: Sparse Pre-Trained Language Models

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based mo...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2021-11
Main Authors:	Zafrir, Ofir, Larey, Ariel, Boudoukh, Guy, Shen, Haihao, Wasserblat, Moshe
Format:	Article
Language:	English
Subjects:	Accuracy Algorithms Coders Compression ratio Distillation Knowledge management Language Natural language Natural language processing Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zafrir, Ofir Larey, Ariel Boudoukh, Guy Shen, Haihao Wasserblat, Moshe
description	Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2596183541</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2596183541</sourcerecordid><originalsourceid>FETCH-proquest_journals_25961835413</originalsourceid><addsrcrecordid>eNqNyrEKwjAUQNEgCBbtPzxwDqRJE6ubiOKgWLB7Cfa1WEJSX8z_6-AHON3h3BnLpFIFr0opFyyPcRRCSLORWquMmZqSR7j5B0IfCPbO7eA-WYoINSFvyD49dnCxfkh2QLiGDl1csXlvXcT81yVbn47N4cwnCq-E8d2OIZH_Uiv11hSV0mWh_rs-OdM0LQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2596183541</pqid></control><display><type>article</type><title>Prune Once for All: Sparse Pre-Trained Language Models</title><source>Publicly Available Content (ProQuest)</source><creator>Zafrir, Ofir ; Larey, Ariel ; Boudoukh, Guy ; Shen, Haihao ; Wasserblat, Moshe</creator><creatorcontrib>Zafrir, Ofir ; Larey, Ariel ; Boudoukh, Guy ; Shen, Haihao ; Wasserblat, Moshe</creatorcontrib><description>Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Algorithms ; Coders ; Compression ratio ; Distillation ; Knowledge management ; Language ; Natural language ; Natural language processing ; Training</subject><ispartof>arXiv.org, 2021-11</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2596183541?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Zafrir, Ofir</creatorcontrib><creatorcontrib>Larey, Ariel</creatorcontrib><creatorcontrib>Boudoukh, Guy</creatorcontrib><creatorcontrib>Shen, Haihao</creatorcontrib><creatorcontrib>Wasserblat, Moshe</creatorcontrib><title>Prune Once for All: Sparse Pre-Trained Language Models</title><title>arXiv.org</title><description>Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Coders</subject><subject>Compression ratio</subject><subject>Distillation</subject><subject>Knowledge management</subject><subject>Language</subject><subject>Natural language</subject><subject>Natural language processing</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNyrEKwjAUQNEgCBbtPzxwDqRJE6ubiOKgWLB7Cfa1WEJSX8z_6-AHON3h3BnLpFIFr0opFyyPcRRCSLORWquMmZqSR7j5B0IfCPbO7eA-WYoINSFvyD49dnCxfkh2QLiGDl1csXlvXcT81yVbn47N4cwnCq-E8d2OIZH_Uiv11hSV0mWh_rs-OdM0LQ</recordid><startdate>20211110</startdate><enddate>20211110</enddate><creator>Zafrir, Ofir</creator><creator>Larey, Ariel</creator><creator>Boudoukh, Guy</creator><creator>Shen, Haihao</creator><creator>Wasserblat, Moshe</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20211110</creationdate><title>Prune Once for All: Sparse Pre-Trained Language Models</title><author>Zafrir, Ofir ; Larey, Ariel ; Boudoukh, Guy ; Shen, Haihao ; Wasserblat, Moshe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25961835413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Coders</topic><topic>Compression ratio</topic><topic>Distillation</topic><topic>Knowledge management</topic><topic>Language</topic><topic>Natural language</topic><topic>Natural language processing</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Zafrir, Ofir</creatorcontrib><creatorcontrib>Larey, Ariel</creatorcontrib><creatorcontrib>Boudoukh, Guy</creatorcontrib><creatorcontrib>Shen, Haihao</creatorcontrib><creatorcontrib>Wasserblat, Moshe</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zafrir, Ofir</au><au>Larey, Ariel</au><au>Boudoukh, Guy</au><au>Shen, Haihao</au><au>Wasserblat, Moshe</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Prune Once for All: Sparse Pre-Trained Language Models</atitle><jtitle>arXiv.org</jtitle><date>2021-11-10</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2596183541
source	Publicly Available Content (ProQuest)
subjects	Accuracy Algorithms Coders Compression ratio Distillation Knowledge management Language Natural language Natural language processing Training
title	Prune Once for All: Sparse Pre-Trained Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A20%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Prune%20Once%20for%20All:%20Sparse%20Pre-Trained%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Zafrir,%20Ofir&rft.date=2021-11-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2596183541%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_25961835413%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2596183541&rft_id=info:pmid/&rfr_iscdi=true