Loading…

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate...

Full description

Saved in:
Bibliographic Details
Published in:ACM computing surveys 2023-07, Vol.55 (14s), p.1-40, Article 304
Main Authors: Fournier, Quentin, Caron, Gaétan Marceau, Aloise, Daniel
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903
cites cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903
container_end_page 40
container_issue 14s
container_start_page 1
container_title ACM computing surveys
container_volume 55
creator Fournier, Quentin
Caron, Gaétan Marceau
Aloise, Daniel
description Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.
doi_str_mv 10.1145/3586074
format article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3586074</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3586074</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</originalsourceid><addsrcrecordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Practical Survey on Faster and Lighter Transformers</title><source>EBSCOhost Business Source Ultimate</source><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creator><creatorcontrib>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creatorcontrib><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><identifier>ISSN: 0360-0300</identifier><identifier>EISSN: 1557-7341</identifier><identifier>DOI: 10.1145/3586074</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Neural networks</subject><ispartof>ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</citedby><cites>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</cites><orcidid>0000-0002-9876-2921 ; 0009-0004-7590-7421 ; 0000-0002-1036-0777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><title>A Practical Survey on Faster and Lighter Transformers</title><title>ACM computing surveys</title><addtitle>ACM CSUR</addtitle><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><subject>Computing methodologies</subject><subject>Neural networks</subject><issn>0360-0300</issn><issn>1557-7341</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</recordid><startdate>20230717</startdate><enddate>20230717</enddate><creator>Fournier, Quentin</creator><creator>Caron, Gaétan Marceau</creator><creator>Aloise, Daniel</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid></search><sort><creationdate>20230717</creationdate><title>A Practical Survey on Faster and Lighter Transformers</title><author>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Neural networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><collection>CrossRef</collection><jtitle>ACM computing surveys</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fournier, Quentin</au><au>Caron, Gaétan Marceau</au><au>Aloise, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Practical Survey on Faster and Lighter Transformers</atitle><jtitle>ACM computing surveys</jtitle><stitle>ACM CSUR</stitle><date>2023-07-17</date><risdate>2023</risdate><volume>55</volume><issue>14s</issue><spage>1</spage><epage>40</epage><pages>1-40</pages><artnum>304</artnum><issn>0360-0300</issn><eissn>1557-7341</eissn><abstract>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3586074</doi><tpages>40</tpages><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0360-0300
ispartof ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304
issn 0360-0300
1557-7341
language eng
recordid cdi_crossref_primary_10_1145_3586074
source EBSCOhost Business Source Ultimate; Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
subjects Computing methodologies
Neural networks
title A Practical Survey on Faster and Lighter Transformers
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T20%3A06%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Practical%20Survey%20on%20Faster%20and%20Lighter%20Transformers&rft.jtitle=ACM%20computing%20surveys&rft.au=Fournier,%20Quentin&rft.date=2023-07-17&rft.volume=55&rft.issue=14s&rft.spage=1&rft.epage=40&rft.pages=1-40&rft.artnum=304&rft.issn=0360-0300&rft.eissn=1557-7341&rft_id=info:doi/10.1145/3586074&rft_dat=%3Cacm_cross%3E3586074%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true