Loading…
A Practical Survey on Faster and Lighter Transformers
Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate...
Saved in:
Published in: | ACM computing surveys 2023-07, Vol.55 (14s), p.1-40, Article 304 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903 |
---|---|
cites | cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903 |
container_end_page | 40 |
container_issue | 14s |
container_start_page | 1 |
container_title | ACM computing surveys |
container_volume | 55 |
creator | Fournier, Quentin Caron, Gaétan Marceau Aloise, Daniel |
description | Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions. |
doi_str_mv | 10.1145/3586074 |
format | article |
fullrecord | <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3586074</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3586074</sourcerecordid><originalsourceid>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</originalsourceid><addsrcrecordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Practical Survey on Faster and Lighter Transformers</title><source>EBSCOhost Business Source Ultimate</source><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creator><creatorcontrib>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</creatorcontrib><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><identifier>ISSN: 0360-0300</identifier><identifier>EISSN: 1557-7341</identifier><identifier>DOI: 10.1145/3586074</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Neural networks</subject><ispartof>ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</citedby><cites>FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</cites><orcidid>0000-0002-9876-2921 ; 0009-0004-7590-7421 ; 0000-0002-1036-0777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><title>A Practical Survey on Faster and Lighter Transformers</title><title>ACM computing surveys</title><addtitle>ACM CSUR</addtitle><description>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</description><subject>Computing methodologies</subject><subject>Neural networks</subject><issn>0360-0300</issn><issn>1557-7341</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j01LxDAURYMoWEdx7yo7V9X3mrykWQ6Do0JBwXFd0nxoZTqVpArz73WY0dW9cA8XDmOXCDeIkm4F1Qq0PGIFEulSC4nHrAChoAQBcMrOcv4AgEqiKhjN-XOybuqdXfOXr_Qdtnzc8KXNU0jcbjxv-rf3XV8lu8lxTENI-ZydRLvO4eKQM_a6vFstHsrm6f5xMW9KW2k9ldRVgEZBFWTQHTqvvFSRiKQkDcqgiJ2qra9rIz0pFD6GWoMx6B1KA2LGrve_Lo05pxDbz9QPNm1bhHZn2x5sf8mrPWnd8A_9jT_QqEyt</recordid><startdate>20230717</startdate><enddate>20230717</enddate><creator>Fournier, Quentin</creator><creator>Caron, Gaétan Marceau</creator><creator>Aloise, Daniel</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid></search><sort><creationdate>20230717</creationdate><title>A Practical Survey on Faster and Lighter Transformers</title><author>Fournier, Quentin ; Caron, Gaétan Marceau ; Aloise, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Neural networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fournier, Quentin</creatorcontrib><creatorcontrib>Caron, Gaétan Marceau</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><collection>CrossRef</collection><jtitle>ACM computing surveys</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fournier, Quentin</au><au>Caron, Gaétan Marceau</au><au>Aloise, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Practical Survey on Faster and Lighter Transformers</atitle><jtitle>ACM computing surveys</jtitle><stitle>ACM CSUR</stitle><date>2023-07-17</date><risdate>2023</risdate><volume>55</volume><issue>14s</issue><spage>1</spage><epage>40</epage><pages>1-40</pages><artnum>304</artnum><issn>0360-0300</issn><eissn>1557-7341</eissn><abstract>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3586074</doi><tpages>40</tpages><orcidid>https://orcid.org/0000-0002-9876-2921</orcidid><orcidid>https://orcid.org/0009-0004-7590-7421</orcidid><orcidid>https://orcid.org/0000-0002-1036-0777</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0360-0300 |
ispartof | ACM computing surveys, 2023-07, Vol.55 (14s), p.1-40, Article 304 |
issn | 0360-0300 1557-7341 |
language | eng |
recordid | cdi_crossref_primary_10_1145_3586074 |
source | EBSCOhost Business Source Ultimate; Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list) |
subjects | Computing methodologies Neural networks |
title | A Practical Survey on Faster and Lighter Transformers |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T20%3A06%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Practical%20Survey%20on%20Faster%20and%20Lighter%20Transformers&rft.jtitle=ACM%20computing%20surveys&rft.au=Fournier,%20Quentin&rft.date=2023-07-17&rft.volume=55&rft.issue=14s&rft.spage=1&rft.epage=40&rft.pages=1-40&rft.artnum=304&rft.issn=0360-0300&rft.eissn=1557-7341&rft_id=info:doi/10.1145/3586074&rft_dat=%3Cacm_cross%3E3586074%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a277t-5b2019602e4e7b1cd6d46f555445706913fb68ad8894d5613dfe870991dc14903%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |