Loading…

Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at

Saved in:

Bibliographic Details
Published in:	arXiv.org 2018-02
Main Authors:	Marinella, Matthew J, Agarwal, Sapan, Hsia, Alexander, Richter, Isaac, Jacobs-Gedrim, Robin, Niroula, John, Plimpton, Steven J, Engin Ipek, James, Conrad D
Format:	Article
Language:	English
Subjects:	Accelerators Accuracy Algorithms Analog circuits Circuit design CMOS Co-design Design Design analysis Multiscale analysis Natural language processing Neural networks Optimization Pattern recognition Static random access memory Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Marinella, Matthew J Agarwal, Sapan Hsia, Alexander Richter, Isaac Jacobs-Gedrim, Robin Niroula, John Plimpton, Steven J Engin Ipek James, Conrad D
description	Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at
doi_str_mv	10.48550/arxiv.1707.09952
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2071604715</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2071604715</sourcerecordid><originalsourceid>FETCH-LOGICAL-a525-c21703a56b6b95315c6a37193fe151664ac9836bfaa5726d4bbfad124e7a85c73</originalsourceid><addsrcrecordid>eNotT8tKw0AUHQTBUvsB7gbcNnUeuTPJMtT6gFShdF9upjchJUx0JhH796bq6hw4D85h7E6KVZoBiAcM3-3XSlphVyLPQV2xmdJaJlmq1A1bxHgSQihjFYCesbAdu6GNDjvi6z55pNg2nhceu3NsI-9rvvEUmvOSlziQdxMpAuGSoz_ywrkxoDtfbMh3tCu2v9G-4W80KR3fB2x965uLlToKOPThll3X2EVa_OOc7Z82-_VLUr4_v66LMkFQkDg1XdAIpjJVDlqCM6itzHVNEqQxKbo806aqEcEqc0yriR6lSsliBs7qObv_q_0I_edIcTic-jFM6-JBCSuNSK0E_QN-ZlqW</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2071604715</pqid></control><display><type>article</type><title>Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator</title><source>Publicly Available Content Database</source><creator>Marinella, Matthew J ; Agarwal, Sapan ; Hsia, Alexander ; Richter, Isaac ; Jacobs-Gedrim, Robin ; Niroula, John ; Plimpton, Steven J ; Engin Ipek ; James, Conrad D</creator><creatorcontrib>Marinella, Matthew J ; Agarwal, Sapan ; Hsia, Alexander ; Richter, Isaac ; Jacobs-Gedrim, Robin ; Niroula, John ; Plimpton, Steven J ; Engin Ipek ; James, Conrad D</creatorcontrib><description>Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This work presents a detailed design using a state of the art 14/16 nm PDK for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared to relevant designs using standard digital ReRAM and SRAM operations. It is shown that the analog accelerator has a 270x energy and 540x latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate (MAC). Compared to an SRAM based accelerator, the energy is 430X better and latency is 34X better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.1707.09952</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accelerators ; Accuracy ; Algorithms ; Analog circuits ; Circuit design ; CMOS ; Co-design ; Design ; Design analysis ; Multiscale analysis ; Natural language processing ; Neural networks ; Optimization ; Pattern recognition ; Static random access memory ; Training</subject><ispartof>arXiv.org, 2018-02</ispartof><rights>2018. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2071604715?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,27925,37012,44590</link.rule.ids></links><search><creatorcontrib>Marinella, Matthew J</creatorcontrib><creatorcontrib>Agarwal, Sapan</creatorcontrib><creatorcontrib>Hsia, Alexander</creatorcontrib><creatorcontrib>Richter, Isaac</creatorcontrib><creatorcontrib>Jacobs-Gedrim, Robin</creatorcontrib><creatorcontrib>Niroula, John</creatorcontrib><creatorcontrib>Plimpton, Steven J</creatorcontrib><creatorcontrib>Engin Ipek</creatorcontrib><creatorcontrib>James, Conrad D</creatorcontrib><title>Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator</title><title>arXiv.org</title><description>Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This work presents a detailed design using a state of the art 14/16 nm PDK for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared to relevant designs using standard digital ReRAM and SRAM operations. It is shown that the analog accelerator has a 270x energy and 540x latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate (MAC). Compared to an SRAM based accelerator, the energy is 430X better and latency is 34X better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies.</description><subject>Accelerators</subject><subject>Accuracy</subject><subject>Algorithms</subject><subject>Analog circuits</subject><subject>Circuit design</subject><subject>CMOS</subject><subject>Co-design</subject><subject>Design</subject><subject>Design analysis</subject><subject>Multiscale analysis</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Pattern recognition</subject><subject>Static random access memory</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNotT8tKw0AUHQTBUvsB7gbcNnUeuTPJMtT6gFShdF9upjchJUx0JhH796bq6hw4D85h7E6KVZoBiAcM3-3XSlphVyLPQV2xmdJaJlmq1A1bxHgSQihjFYCesbAdu6GNDjvi6z55pNg2nhceu3NsI-9rvvEUmvOSlziQdxMpAuGSoz_ywrkxoDtfbMh3tCu2v9G-4W80KR3fB2x965uLlToKOPThll3X2EVa_OOc7Z82-_VLUr4_v66LMkFQkDg1XdAIpjJVDlqCM6itzHVNEqQxKbo806aqEcEqc0yriR6lSsliBs7qObv_q_0I_edIcTic-jFM6-JBCSuNSK0E_QN-ZlqW</recordid><startdate>20180217</startdate><enddate>20180217</enddate><creator>Marinella, Matthew J</creator><creator>Agarwal, Sapan</creator><creator>Hsia, Alexander</creator><creator>Richter, Isaac</creator><creator>Jacobs-Gedrim, Robin</creator><creator>Niroula, John</creator><creator>Plimpton, Steven J</creator><creator>Engin Ipek</creator><creator>James, Conrad D</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20180217</creationdate><title>Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator</title><author>Marinella, Matthew J ; Agarwal, Sapan ; Hsia, Alexander ; Richter, Isaac ; Jacobs-Gedrim, Robin ; Niroula, John ; Plimpton, Steven J ; Engin Ipek ; James, Conrad D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a525-c21703a56b6b95315c6a37193fe151664ac9836bfaa5726d4bbfad124e7a85c73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Accelerators</topic><topic>Accuracy</topic><topic>Algorithms</topic><topic>Analog circuits</topic><topic>Circuit design</topic><topic>CMOS</topic><topic>Co-design</topic><topic>Design</topic><topic>Design analysis</topic><topic>Multiscale analysis</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Pattern recognition</topic><topic>Static random access memory</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Marinella, Matthew J</creatorcontrib><creatorcontrib>Agarwal, Sapan</creatorcontrib><creatorcontrib>Hsia, Alexander</creatorcontrib><creatorcontrib>Richter, Isaac</creatorcontrib><creatorcontrib>Jacobs-Gedrim, Robin</creatorcontrib><creatorcontrib>Niroula, John</creatorcontrib><creatorcontrib>Plimpton, Steven J</creatorcontrib><creatorcontrib>Engin Ipek</creatorcontrib><creatorcontrib>James, Conrad D</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Marinella, Matthew J</au><au>Agarwal, Sapan</au><au>Hsia, Alexander</au><au>Richter, Isaac</au><au>Jacobs-Gedrim, Robin</au><au>Niroula, John</au><au>Plimpton, Steven J</au><au>Engin Ipek</au><au>James, Conrad D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator</atitle><jtitle>arXiv.org</jtitle><date>2018-02-17</date><risdate>2018</risdate><eissn>2331-8422</eissn><abstract>Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This work presents a detailed design using a state of the art 14/16 nm PDK for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared to relevant designs using standard digital ReRAM and SRAM operations. It is shown that the analog accelerator has a 270x energy and 540x latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate (MAC). Compared to an SRAM based accelerator, the energy is 430X better and latency is 34X better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.1707.09952</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2018-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2071604715
source	Publicly Available Content Database
subjects	Accelerators Accuracy Algorithms Analog circuits Circuit design CMOS Co-design Design Design analysis Multiscale analysis Natural language processing Neural networks Optimization Pattern recognition Static random access memory Training
title	Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T08%3A27%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multiscale%20Co-Design%20Analysis%20of%20Energy,%20Latency,%20Area,%20and%20Accuracy%20of%20a%20ReRAM%20Analog%20Neural%20Training%20Accelerator&rft.jtitle=arXiv.org&rft.au=Marinella,%20Matthew%20J&rft.date=2018-02-17&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.1707.09952&rft_dat=%3Cproquest%3E2071604715%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a525-c21703a56b6b95315c6a37193fe151664ac9836bfaa5726d4bbfad124e7a85c73%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2071604715&rft_id=info:pmid/&rfr_iscdi=true