Loading…
Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs
In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 54 |
container_issue | |
container_start_page | 47 |
container_title | |
container_volume | |
creator | Emmart, Niall Luitjens, Justin Weems, Charles Woolley, Cliff |
description | In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti. |
doi_str_mv | 10.1109/ARITH.2016.21 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_7563271</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7563271</ieee_id><sourcerecordid>7563271</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</originalsourceid><addsrcrecordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><source>IEEE Xplore All Conference Series</source><creator>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creator><creatorcontrib>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creatorcontrib><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><identifier>ISSN: 1063-6889</identifier><identifier>EISBN: 9781509016167</identifier><identifier>EISBN: 1509016163</identifier><identifier>DOI: 10.1109/ARITH.2016.21</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Algorithm design and analysis ; GPGPU ; Graphics processing units ; Hardware ; Instruction sets ; Libraries ; Modular Exponentiation ; Modular Multiplication ; Multiple Precision Arithmetic ; NVIDIA Maxwell ; Registers ; RSA ; Throughput</subject><ispartof>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><title>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</title><addtitle>ARITH</addtitle><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><subject>Algorithm design and analysis</subject><subject>GPGPU</subject><subject>Graphics processing units</subject><subject>Hardware</subject><subject>Instruction sets</subject><subject>Libraries</subject><subject>Modular Exponentiation</subject><subject>Modular Multiplication</subject><subject>Multiple Precision Arithmetic</subject><subject>NVIDIA Maxwell</subject><subject>Registers</subject><subject>RSA</subject><subject>Throughput</subject><issn>1063-6889</issn><isbn>9781509016167</isbn><isbn>1509016163</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2016</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</recordid><startdate>201607</startdate><enddate>201607</enddate><creator>Emmart, Niall</creator><creator>Luitjens, Justin</creator><creator>Weems, Charles</creator><creator>Woolley, Cliff</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201607</creationdate><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><author>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithm design and analysis</topic><topic>GPGPU</topic><topic>Graphics processing units</topic><topic>Hardware</topic><topic>Instruction sets</topic><topic>Libraries</topic><topic>Modular Exponentiation</topic><topic>Modular Multiplication</topic><topic>Multiple Precision Arithmetic</topic><topic>NVIDIA Maxwell</topic><topic>Registers</topic><topic>RSA</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Emmart, Niall</au><au>Luitjens, Justin</au><au>Weems, Charles</au><au>Woolley, Cliff</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</atitle><btitle>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</btitle><stitle>ARITH</stitle><date>2016-07</date><risdate>2016</risdate><spage>47</spage><epage>54</epage><pages>47-54</pages><issn>1063-6889</issn><eisbn>9781509016167</eisbn><eisbn>1509016163</eisbn><coden>IEEPAD</coden><abstract>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</abstract><pub>IEEE</pub><doi>10.1109/ARITH.2016.21</doi><tpages>8</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1063-6889 |
ispartof | 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54 |
issn | 1063-6889 |
language | eng |
recordid | cdi_ieee_primary_7563271 |
source | IEEE Xplore All Conference Series |
subjects | Algorithm design and analysis GPGPU Graphics processing units Hardware Instruction sets Libraries Modular Exponentiation Modular Multiplication Multiple Precision Arithmetic NVIDIA Maxwell Registers RSA Throughput |
title | Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T13%3A19%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Modular%20Multiplication%20for%20NVIDIA's%20Maxwell%20GPUs&rft.btitle=2016%20IEEE%2023nd%20Symposium%20on%20Computer%20Arithmetic%20(ARITH)&rft.au=Emmart,%20Niall&rft.date=2016-07&rft.spage=47&rft.epage=54&rft.pages=47-54&rft.issn=1063-6889&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ARITH.2016.21&rft.eisbn=9781509016167&rft.eisbn_list=1509016163&rft_dat=%3Cieee_CHZPO%3E7563271%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7563271&rfr_iscdi=true |