Loading…

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs

In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in...

Full description

Saved in:

Bibliographic Details
Main Authors:	Emmart, Niall, Luitjens, Justin, Weems, Charles, Woolley, Cliff
Format:	Conference Proceeding
Language:	English
Subjects:	Algorithm design and analysis GPGPU Graphics processing units Hardware Instruction sets Libraries Modular Exponentiation Modular Multiplication Multiple Precision Arithmetic NVIDIA Maxwell Registers RSA Throughput
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	54
container_issue
container_start_page	47
container_title
container_volume
creator	Emmart, Niall Luitjens, Justin Weems, Charles Woolley, Cliff
description	In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.
doi_str_mv	10.1109/ARITH.2016.21
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_7563271</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7563271</ieee_id><sourcerecordid>7563271</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</originalsourceid><addsrcrecordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><source>IEEE Xplore All Conference Series</source><creator>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creator><creatorcontrib>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creatorcontrib><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><identifier>ISSN: 1063-6889</identifier><identifier>EISBN: 9781509016167</identifier><identifier>EISBN: 1509016163</identifier><identifier>DOI: 10.1109/ARITH.2016.21</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Algorithm design and analysis ; GPGPU ; Graphics processing units ; Hardware ; Instruction sets ; Libraries ; Modular Exponentiation ; Modular Multiplication ; Multiple Precision Arithmetic ; NVIDIA Maxwell ; Registers ; RSA ; Throughput</subject><ispartof>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><title>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</title><addtitle>ARITH</addtitle><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><subject>Algorithm design and analysis</subject><subject>GPGPU</subject><subject>Graphics processing units</subject><subject>Hardware</subject><subject>Instruction sets</subject><subject>Libraries</subject><subject>Modular Exponentiation</subject><subject>Modular Multiplication</subject><subject>Multiple Precision Arithmetic</subject><subject>NVIDIA Maxwell</subject><subject>Registers</subject><subject>RSA</subject><subject>Throughput</subject><issn>1063-6889</issn><isbn>9781509016167</isbn><isbn>1509016163</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2016</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</recordid><startdate>201607</startdate><enddate>201607</enddate><creator>Emmart, Niall</creator><creator>Luitjens, Justin</creator><creator>Weems, Charles</creator><creator>Woolley, Cliff</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201607</creationdate><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><author>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithm design and analysis</topic><topic>GPGPU</topic><topic>Graphics processing units</topic><topic>Hardware</topic><topic>Instruction sets</topic><topic>Libraries</topic><topic>Modular Exponentiation</topic><topic>Modular Multiplication</topic><topic>Multiple Precision Arithmetic</topic><topic>NVIDIA Maxwell</topic><topic>Registers</topic><topic>RSA</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Emmart, Niall</au><au>Luitjens, Justin</au><au>Weems, Charles</au><au>Woolley, Cliff</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</atitle><btitle>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</btitle><stitle>ARITH</stitle><date>2016-07</date><risdate>2016</risdate><spage>47</spage><epage>54</epage><pages>47-54</pages><issn>1063-6889</issn><eisbn>9781509016167</eisbn><eisbn>1509016163</eisbn><coden>IEEPAD</coden><abstract>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</abstract><pub>IEEE</pub><doi>10.1109/ARITH.2016.21</doi><tpages>8</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1063-6889
ispartof	2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54
issn	1063-6889
language	eng
recordid	cdi_ieee_primary_7563271
source	IEEE Xplore All Conference Series
subjects	Algorithm design and analysis GPGPU Graphics processing units Hardware Instruction sets Libraries Modular Exponentiation Modular Multiplication Multiple Precision Arithmetic NVIDIA Maxwell Registers RSA Throughput
title	Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T13%3A19%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Modular%20Multiplication%20for%20NVIDIA's%20Maxwell%20GPUs&rft.btitle=2016%20IEEE%2023nd%20Symposium%20on%20Computer%20Arithmetic%20(ARITH)&rft.au=Emmart,%20Niall&rft.date=2016-07&rft.spage=47&rft.epage=54&rft.pages=47-54&rft.issn=1063-6889&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ARITH.2016.21&rft.eisbn=9781509016167&rft.eisbn_list=1509016163&rft_dat=%3Cieee_CHZPO%3E7563271%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7563271&rfr_iscdi=true