Loading…

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate...

Full description

Saved in:

Bibliographic Details
Published in:	ACM computing surveys 2023-07, Vol.55 (14s), p.1-40, Article 304
Main Authors:	Fournier, Quentin, Caron, Gaétan Marceau, Aloise, Daniel
Format:	Article
Language:	English
Subjects:	Computing methodologies Neural networks
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903
cites	cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903
container_end_page	40
container_issue	14s
container_start_page	1
container_title	ACM computing surveys
container_volume	55
creator	Fournier, Quentin Caron, Gaétan Marceau Aloise, Daniel
description	Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.
doi_str_mv	10.1145/3586074
format	article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3586074</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3586074</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</originalsourceid><addsrcrecordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Practical Survey on Faster and Lighter Transformers</title><source>EBSCOhost Business Source Ultimate</source><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creator><creatorcontrib>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creatorcontrib><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><identifier>ISSN: 0360-0300</identifier><identifier>EISSN: 1557-7341</identifier><identifier>DOI: 10.1145/3586074</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Neural networks</subject><ispartof>ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</citedby><cites>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</cites><orcidid>0000-0002-9876-2921 ; 0009-0004-7590-7421 ; 0000-0002-1036-0777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><title>A Practical Survey on Faster and Lighter Transformers</title><title>ACM computing surveys</title><addtitle>ACM CSUR</addtitle><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><subject>Computing methodologies</subject><subject>Neural networks</subject><issn>0360-0300</issn><issn>1557-7341</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</recordid><startdate>20230717</startdate><enddate>20230717</enddate><creator>Fournier, Quentin</creator><creator>Caron, Gaétan Marceau</creator><creator>Aloise, Daniel</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid></search><sort><creationdate>20230717</creationdate><title>A Practical Survey on Faster and Lighter Transformers</title><author>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Neural networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><collection>CrossRef</collection><jtitle>ACM computing surveys</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fournier, Quentin</au><au>Caron, Gaétan Marceau</au><au>Aloise, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Practical Survey on Faster and Lighter Transformers</atitle><jtitle>ACM computing surveys</jtitle><stitle>ACM CSUR</stitle><date>2023-07-17</date><risdate>2023</risdate><volume>55</volume><issue>14s</issue><spage>1</spage><epage>40</epage><pages>1-40</pages><artnum>304</artnum><issn>0360-0300</issn><eissn>1557-7341</eissn><abstract>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3586074</doi><tpages>40</tpages><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0360-0300
ispartof	ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304
issn	0360-0300 1557-7341
language	eng
recordid	cdi_crossref_primary_10_1145_3586074
source	EBSCOhost Business Source Ultimate; Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
subjects	Computing methodologies Neural networks
title	A Practical Survey on Faster and Lighter Transformers
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T20%3A06%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Practical%20Survey%20on%20Faster%20and%20Lighter%20Transformers&rft.jtitle=ACM%20computing%20surveys&rft.au=Fournier,%20Quentin&rft.date=2023-07-17&rft.volume=55&rft.issue=14s&rft.spage=1&rft.epage=40&rft.pages=1-40&rft.artnum=304&rft.issn=0360-0300&rft.eissn=1557-7341&rft_id=info:doi/10.1145/3586074&rft_dat=%3Cacm_cross%3E3586074%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true