Loading…

Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms

This paper presents an innovative auto batching scheme designed to optimize Long Short-Term Memory (LSTM) inference on Field-Programmable Gate Array (FPGA) platforms. Existing block batching methods face challenges with LSTM models that have large hidden sizes due to insufficient on-chip memory, whi...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2024, Vol.12, p.159380-159394
Main Authors: Jin Kim, Byoung, Chung, Eui-Young
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c289t-29832ca8a71b002d5abf35405c9fe280ad1343285e3cd9a599b7b4c0ad2bbdef3
container_end_page 159394
container_issue
container_start_page 159380
container_title IEEE access
container_volume 12
creator Jin Kim, Byoung
Chung, Eui-Young
description This paper presents an innovative auto batching scheme designed to optimize Long Short-Term Memory (LSTM) inference on Field-Programmable Gate Array (FPGA) platforms. Existing block batching methods face challenges with LSTM models that have large hidden sizes due to insufficient on-chip memory, which impedes prefetching and leads to repeated evictions and reloads, significantly reducing processing utilization. Our approach extends block batching with weight stationary block batching (WSBB), allowing computation without stalls regardless of prefetch availability.Additionally, bypass-enabled block batching (BEBB) ensures that even when on-chip memory is insufficient, it prevents contamination on-chip while fully leveraging off-chip memory bandwidth. Experimental results from both synthetic benchmarks (Deepbench suite) and real-world applications (RNN-T) validate the superior performance and efficiency of the proposed method. Our auto batching scheme demonstrates up to 3.7 times speedup over previous block batching while maintaining high computational efficiency, even with limited on-chip memory. Furthermore, the FPGA-based implementation of our scheme achieves a 5 times speedup over the CPU and 4.3 times higher energy efficiency (GFLOP/s/W) compared to the GPU.
doi_str_mv 10.1109/ACCESS.2024.3488033
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_10738811</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10738811</ieee_id><doaj_id>oai_doaj_org_article_b183cd915e424c9d8b354318380c98e0</doaj_id><sourcerecordid>3124826952</sourcerecordid><originalsourceid>FETCH-LOGICAL-c289t-29832ca8a71b002d5abf35405c9fe280ad1343285e3cd9a599b7b4c0ad2bbdef3</originalsourceid><addsrcrecordid>eNpNkF9PwjAUxRejiQT5BPqwxOdh_421j3MCkmAgGT43bXcHI7BiNx7009s5YrgvvfnlnHObEwSPGI0xRuIlzbJpno8JImxMGeeI0ptgQPBERDSmk9ur_T4YNc0e-eEexckgeEvPrQ1fVWt2Vb0Nc7ODI4SldeHq1FbH6qejy3zzES7qEhzUBkJbh7P1PA3XB9V65bF5CO5KdWhgdHmHwedsusneo-VqvsjSZWQIF21EBKfEKK4SrBEiRax0SWOGYiNKIBypAlNGCY-BmkKoWAidaGY8J1oXUNJhsOhzC6v28uSqo3Lf0qpK_gHrtlK5tjIHkBrzLgTHwAgzouDaX6IecmQEB-Sznvusk7NfZ2haubdnV_vvS4oJ48T3Q7yK9irjbNM4KP-vYiS79mXfvuzal5f2veupd1UAcOVIKOcY018I7X44</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3124826952</pqid></control><display><type>article</type><title>Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms</title><source>IEEE Xplore Open Access Journals</source><creator>Jin Kim, Byoung ; Chung, Eui-Young</creator><creatorcontrib>Jin Kim, Byoung ; Chung, Eui-Young</creatorcontrib><description>This paper presents an innovative auto batching scheme designed to optimize Long Short-Term Memory (LSTM) inference on Field-Programmable Gate Array (FPGA) platforms. Existing block batching methods face challenges with LSTM models that have large hidden sizes due to insufficient on-chip memory, which impedes prefetching and leads to repeated evictions and reloads, significantly reducing processing utilization. Our approach extends block batching with weight stationary block batching (WSBB), allowing computation without stalls regardless of prefetch availability.Additionally, bypass-enabled block batching (BEBB) ensures that even when on-chip memory is insufficient, it prevents contamination on-chip while fully leveraging off-chip memory bandwidth. Experimental results from both synthetic benchmarks (Deepbench suite) and real-world applications (RNN-T) validate the superior performance and efficiency of the proposed method. Our auto batching scheme demonstrates up to 3.7 times speedup over previous block batching while maintaining high computational efficiency, even with limited on-chip memory. Furthermore, the FPGA-based implementation of our scheme achieves a 5 times speedup over the CPU and 4.3 times higher energy efficiency (GFLOP/s/W) compared to the GPU.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3488033</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Accelerator ; batch processing ; Chips (memory devices) ; Computational efficiency ; Computational modeling ; Computer memory ; Efficiency ; Field programmable gate arrays ; FPGA ; Inference ; Logic gates ; Long short term memory ; LSTM ; Memory management ; Optimization ; pipeline stalls ; Platforms ; Prefetching ; System-on-chip ; Throughput ; Vectors</subject><ispartof>IEEE access, 2024, Vol.12, p.159380-159394</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c289t-29832ca8a71b002d5abf35405c9fe280ad1343285e3cd9a599b7b4c0ad2bbdef3</cites><orcidid>0000-0003-2013-8763 ; 0009-0009-8228-5655</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10738811$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,4009,27612,27902,27903,27904,54911</link.rule.ids></links><search><creatorcontrib>Jin Kim, Byoung</creatorcontrib><creatorcontrib>Chung, Eui-Young</creatorcontrib><title>Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms</title><title>IEEE access</title><addtitle>Access</addtitle><description>This paper presents an innovative auto batching scheme designed to optimize Long Short-Term Memory (LSTM) inference on Field-Programmable Gate Array (FPGA) platforms. Existing block batching methods face challenges with LSTM models that have large hidden sizes due to insufficient on-chip memory, which impedes prefetching and leads to repeated evictions and reloads, significantly reducing processing utilization. Our approach extends block batching with weight stationary block batching (WSBB), allowing computation without stalls regardless of prefetch availability.Additionally, bypass-enabled block batching (BEBB) ensures that even when on-chip memory is insufficient, it prevents contamination on-chip while fully leveraging off-chip memory bandwidth. Experimental results from both synthetic benchmarks (Deepbench suite) and real-world applications (RNN-T) validate the superior performance and efficiency of the proposed method. Our auto batching scheme demonstrates up to 3.7 times speedup over previous block batching while maintaining high computational efficiency, even with limited on-chip memory. Furthermore, the FPGA-based implementation of our scheme achieves a 5 times speedup over the CPU and 4.3 times higher energy efficiency (GFLOP/s/W) compared to the GPU.</description><subject>Accelerator</subject><subject>batch processing</subject><subject>Chips (memory devices)</subject><subject>Computational efficiency</subject><subject>Computational modeling</subject><subject>Computer memory</subject><subject>Efficiency</subject><subject>Field programmable gate arrays</subject><subject>FPGA</subject><subject>Inference</subject><subject>Logic gates</subject><subject>Long short term memory</subject><subject>LSTM</subject><subject>Memory management</subject><subject>Optimization</subject><subject>pipeline stalls</subject><subject>Platforms</subject><subject>Prefetching</subject><subject>System-on-chip</subject><subject>Throughput</subject><subject>Vectors</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>DOA</sourceid><recordid>eNpNkF9PwjAUxRejiQT5BPqwxOdh_421j3MCkmAgGT43bXcHI7BiNx7009s5YrgvvfnlnHObEwSPGI0xRuIlzbJpno8JImxMGeeI0ptgQPBERDSmk9ur_T4YNc0e-eEexckgeEvPrQ1fVWt2Vb0Nc7ODI4SldeHq1FbH6qejy3zzES7qEhzUBkJbh7P1PA3XB9V65bF5CO5KdWhgdHmHwedsusneo-VqvsjSZWQIF21EBKfEKK4SrBEiRax0SWOGYiNKIBypAlNGCY-BmkKoWAidaGY8J1oXUNJhsOhzC6v28uSqo3Lf0qpK_gHrtlK5tjIHkBrzLgTHwAgzouDaX6IecmQEB-Sznvusk7NfZ2haubdnV_vvS4oJ48T3Q7yK9irjbNM4KP-vYiS79mXfvuzal5f2veupd1UAcOVIKOcY018I7X44</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Jin Kim, Byoung</creator><creator>Chung, Eui-Young</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-2013-8763</orcidid><orcidid>https://orcid.org/0009-0009-8228-5655</orcidid></search><sort><creationdate>2024</creationdate><title>Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms</title><author>Jin Kim, Byoung ; Chung, Eui-Young</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c289t-29832ca8a71b002d5abf35405c9fe280ad1343285e3cd9a599b7b4c0ad2bbdef3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accelerator</topic><topic>batch processing</topic><topic>Chips (memory devices)</topic><topic>Computational efficiency</topic><topic>Computational modeling</topic><topic>Computer memory</topic><topic>Efficiency</topic><topic>Field programmable gate arrays</topic><topic>FPGA</topic><topic>Inference</topic><topic>Logic gates</topic><topic>Long short term memory</topic><topic>LSTM</topic><topic>Memory management</topic><topic>Optimization</topic><topic>pipeline stalls</topic><topic>Platforms</topic><topic>Prefetching</topic><topic>System-on-chip</topic><topic>Throughput</topic><topic>Vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jin Kim, Byoung</creatorcontrib><creatorcontrib>Chung, Eui-Young</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jin Kim, Byoung</au><au>Chung, Eui-Young</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>159380</spage><epage>159394</epage><pages>159380-159394</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>This paper presents an innovative auto batching scheme designed to optimize Long Short-Term Memory (LSTM) inference on Field-Programmable Gate Array (FPGA) platforms. Existing block batching methods face challenges with LSTM models that have large hidden sizes due to insufficient on-chip memory, which impedes prefetching and leads to repeated evictions and reloads, significantly reducing processing utilization. Our approach extends block batching with weight stationary block batching (WSBB), allowing computation without stalls regardless of prefetch availability.Additionally, bypass-enabled block batching (BEBB) ensures that even when on-chip memory is insufficient, it prevents contamination on-chip while fully leveraging off-chip memory bandwidth. Experimental results from both synthetic benchmarks (Deepbench suite) and real-world applications (RNN-T) validate the superior performance and efficiency of the proposed method. Our auto batching scheme demonstrates up to 3.7 times speedup over previous block batching while maintaining high computational efficiency, even with limited on-chip memory. Furthermore, the FPGA-based implementation of our scheme achieves a 5 times speedup over the CPU and 4.3 times higher energy efficiency (GFLOP/s/W) compared to the GPU.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3488033</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-2013-8763</orcidid><orcidid>https://orcid.org/0009-0009-8228-5655</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024, Vol.12, p.159380-159394
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_10738811
source IEEE Xplore Open Access Journals
subjects Accelerator
batch processing
Chips (memory devices)
Computational efficiency
Computational modeling
Computer memory
Efficiency
Field programmable gate arrays
FPGA
Inference
Logic gates
Long short term memory
LSTM
Memory management
Optimization
pipeline stalls
Platforms
Prefetching
System-on-chip
Throughput
Vectors
title Auto Batching Scheme for Optimizing LSTM Inference on FPGA Platforms
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T23%3A36%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Auto%20Batching%20Scheme%20for%20Optimizing%20LSTM%20Inference%20on%20FPGA%20Platforms&rft.jtitle=IEEE%20access&rft.au=Jin%20Kim,%20Byoung&rft.date=2024&rft.volume=12&rft.spage=159380&rft.epage=159394&rft.pages=159380-159394&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3488033&rft_dat=%3Cproquest_ieee_%3E3124826952%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c289t-29832ca8a71b002d5abf35405c9fe280ad1343285e3cd9a599b7b4c0ad2bbdef3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3124826952&rft_id=info:pmid/&rft_ieee_id=10738811&rfr_iscdi=true