Loading…

Fast Arbitrary Precision Floating Point on FPGA

Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectu...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2022-04
Main Authors:	de Fine Licht, Johannes, Pattison, Christopher A, Ziogas, Alexandros Nikolaos, Simmons-Duffin, David, Hoefler, Torsten
Format:	Article
Language:	English
Subjects:	Central processing units CPUs Field programmable gate arrays Floating point arithmetic Matrices (mathematics) Microprocessors Multiplication Pipelining (computers) Software
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	de Fine Licht, Johannes Pattison, Christopher A Ziogas, Alexandros Nikolaos Simmons-Duffin, David Hoefler, Torsten
description	Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2650101366</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2650101366</sourcerecordid><originalsourceid>FETCH-proquest_journals_26501013663</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQd0ssLlFwLErKLClKLKpUCChKTc4szszPU3DLyU8sycxLVwjIz8wrUQCJBLg78jCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGZqYGhgaGxmZkxcaoA3UYx0w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2650101366</pqid></control><display><type>article</type><title>Fast Arbitrary Precision Floating Point on FPGA</title><source>Publicly Available Content Database</source><creator>de Fine Licht, Johannes ; Pattison, Christopher A ; Ziogas, Alexandros Nikolaos ; Simmons-Duffin, David ; Hoefler, Torsten</creator><creatorcontrib>de Fine Licht, Johannes ; Pattison, Christopher A ; Ziogas, Alexandros Nikolaos ; Simmons-Duffin, David ; Hoefler, Torsten</creatorcontrib><description>Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Central processing units ; CPUs ; Field programmable gate arrays ; Floating point arithmetic ; Matrices (mathematics) ; Microprocessors ; Multiplication ; Pipelining (computers) ; Software</subject><ispartof>arXiv.org, 2022-04</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2650101366?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>de Fine Licht, Johannes</creatorcontrib><creatorcontrib>Pattison, Christopher A</creatorcontrib><creatorcontrib>Ziogas, Alexandros Nikolaos</creatorcontrib><creatorcontrib>Simmons-Duffin, David</creatorcontrib><creatorcontrib>Hoefler, Torsten</creatorcontrib><title>Fast Arbitrary Precision Floating Point on FPGA</title><title>arXiv.org</title><description>Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.</description><subject>Central processing units</subject><subject>CPUs</subject><subject>Field programmable gate arrays</subject><subject>Floating point arithmetic</subject><subject>Matrices (mathematics)</subject><subject>Microprocessors</subject><subject>Multiplication</subject><subject>Pipelining (computers)</subject><subject>Software</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQd0ssLlFwLErKLClKLKpUCChKTc4szszPU3DLyU8sycxLVwjIz8wrUQCJBLg78jCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGZqYGhgaGxmZkxcaoA3UYx0w</recordid><startdate>20220413</startdate><enddate>20220413</enddate><creator>de Fine Licht, Johannes</creator><creator>Pattison, Christopher A</creator><creator>Ziogas, Alexandros Nikolaos</creator><creator>Simmons-Duffin, David</creator><creator>Hoefler, Torsten</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220413</creationdate><title>Fast Arbitrary Precision Floating Point on FPGA</title><author>de Fine Licht, Johannes ; Pattison, Christopher A ; Ziogas, Alexandros Nikolaos ; Simmons-Duffin, David ; Hoefler, Torsten</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26501013663</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Central processing units</topic><topic>CPUs</topic><topic>Field programmable gate arrays</topic><topic>Floating point arithmetic</topic><topic>Matrices (mathematics)</topic><topic>Microprocessors</topic><topic>Multiplication</topic><topic>Pipelining (computers)</topic><topic>Software</topic><toplevel>online_resources</toplevel><creatorcontrib>de Fine Licht, Johannes</creatorcontrib><creatorcontrib>Pattison, Christopher A</creatorcontrib><creatorcontrib>Ziogas, Alexandros Nikolaos</creatorcontrib><creatorcontrib>Simmons-Duffin, David</creatorcontrib><creatorcontrib>Hoefler, Torsten</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>de Fine Licht, Johannes</au><au>Pattison, Christopher A</au><au>Ziogas, Alexandros Nikolaos</au><au>Simmons-Duffin, David</au><au>Hoefler, Torsten</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Fast Arbitrary Precision Floating Point on FPGA</atitle><jtitle>arXiv.org</jtitle><date>2022-04-13</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-04
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2650101366
source	Publicly Available Content Database
subjects	Central processing units CPUs Field programmable gate arrays Floating point arithmetic Matrices (mathematics) Microprocessors Multiplication Pipelining (computers) Software
title	Fast Arbitrary Precision Floating Point on FPGA
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T18%3A17%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Fast%20Arbitrary%20Precision%20Floating%20Point%20on%20FPGA&rft.jtitle=arXiv.org&rft.au=de%20Fine%20Licht,%20Johannes&rft.date=2022-04-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2650101366%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_26501013663%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2650101366&rft_id=info:pmid/&rfr_iscdi=true