Loading…

Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs

In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in...

Full description

Saved in:
Bibliographic Details
Main Authors: Emmart, Niall, Luitjens, Justin, Weems, Charles, Woolley, Cliff
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 54
container_issue
container_start_page 47
container_title
container_volume
creator Emmart, Niall
Luitjens, Justin
Weems, Charles
Woolley, Cliff
description In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.
doi_str_mv 10.1109/ARITH.2016.21
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_7563271</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7563271</ieee_id><sourcerecordid>7563271</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</originalsourceid><addsrcrecordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><source>IEEE Xplore All Conference Series</source><creator>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creator><creatorcontrib>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</creatorcontrib><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><identifier>ISSN: 1063-6889</identifier><identifier>EISBN: 9781509016167</identifier><identifier>EISBN: 1509016163</identifier><identifier>DOI: 10.1109/ARITH.2016.21</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Algorithm design and analysis ; GPGPU ; Graphics processing units ; Hardware ; Instruction sets ; Libraries ; Modular Exponentiation ; Modular Multiplication ; Multiple Precision Arithmetic ; NVIDIA Maxwell ; Registers ; RSA ; Throughput</subject><ispartof>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7563271$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><title>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</title><addtitle>ARITH</addtitle><description>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</description><subject>Algorithm design and analysis</subject><subject>GPGPU</subject><subject>Graphics processing units</subject><subject>Hardware</subject><subject>Instruction sets</subject><subject>Libraries</subject><subject>Modular Exponentiation</subject><subject>Modular Multiplication</subject><subject>Multiple Precision Arithmetic</subject><subject>NVIDIA Maxwell</subject><subject>Registers</subject><subject>RSA</subject><subject>Throughput</subject><issn>1063-6889</issn><isbn>9781509016167</isbn><isbn>1509016163</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2016</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotzDtPwzAUQGEjgUQpHZlYvDEl-PpxHYspKtBGaihCLWtl8A0ycpsqScXj14ME09G3HMYuQOQAwl2XT9VqnksBmEs4YhNnCzDC_RrQHrMRCFQZFoU7ZWd9_y4EOId2xG6W-yFu43fcvfG6DYfkO14f0hD3Kb76IbY73rQdf3iubqvyque1__yglPjscd2fs5PGp54m_x2z9f3dajrPFstZNS0XWQRrhozIknBSGXRaa0UCA9hCgQeNJKHBhlBKrTGEBnzwAkwojHshqUAZr9WYXf59IxFt9l3c-u5rYw0qaUH9ANfyRVo</recordid><startdate>201607</startdate><enddate>201607</enddate><creator>Emmart, Niall</creator><creator>Luitjens, Justin</creator><creator>Weems, Charles</creator><creator>Woolley, Cliff</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201607</creationdate><title>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</title><author>Emmart, Niall ; Luitjens, Justin ; Weems, Charles ; Woolley, Cliff</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithm design and analysis</topic><topic>GPGPU</topic><topic>Graphics processing units</topic><topic>Hardware</topic><topic>Instruction sets</topic><topic>Libraries</topic><topic>Modular Exponentiation</topic><topic>Modular Multiplication</topic><topic>Multiple Precision Arithmetic</topic><topic>NVIDIA Maxwell</topic><topic>Registers</topic><topic>RSA</topic><topic>Throughput</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Emmart, Niall</creatorcontrib><creatorcontrib>Luitjens, Justin</creatorcontrib><creatorcontrib>Weems, Charles</creatorcontrib><creatorcontrib>Woolley, Cliff</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Emmart, Niall</au><au>Luitjens, Justin</au><au>Weems, Charles</au><au>Woolley, Cliff</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs</atitle><btitle>2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)</btitle><stitle>ARITH</stitle><date>2016-07</date><risdate>2016</risdate><spage>47</spage><epage>54</epage><pages>47-54</pages><issn>1063-6889</issn><eisbn>9781509016167</eisbn><eisbn>1509016163</eisbn><coden>IEEPAD</coden><abstract>In this paper we show how we were able to achieve record rates of multiple precision (MP) modular multiplication (mulmod) operations in the new NVIDIA MP math library (XMP) on Maxwell, NVIDIA's most recent generation of graphics processing units (GPUs). Mulmod is a key operation that is used in multiple places within the MP library, and has many real world applications, especially in cryptography, which makes it important to achieve a highly optimized implementation. Here we reveal how multiple techniques were combined to make the best use of the GPU'sinstructions, registers, memory, and threads. A particularly interesting algorithmic aspect, designed to work with the 16-bit hardware multipliers found in Maxwell, is the use of a two-pass process to first compute unaligned partial products, then shift the result 16 bits to the left, then compute the aligned partial products. The new algorithms are much faster than the prior, state of the art, row-oriented multiply and reduce approach, achieving speedups of 61% at 256 bits, and 117% at 512 bits, with peaks rates of 4027 million mulmod operations at 256 bits and 1081 million at 512 bits on a GTX 980Ti.</abstract><pub>IEEE</pub><doi>10.1109/ARITH.2016.21</doi><tpages>8</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1063-6889
ispartof 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), 2016, p.47-54
issn 1063-6889
language eng
recordid cdi_ieee_primary_7563271
source IEEE Xplore All Conference Series
subjects Algorithm design and analysis
GPGPU
Graphics processing units
Hardware
Instruction sets
Libraries
Modular Exponentiation
Modular Multiplication
Multiple Precision Arithmetic
NVIDIA Maxwell
Registers
RSA
Throughput
title Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T13%3A19%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Optimizing%20Modular%20Multiplication%20for%20NVIDIA's%20Maxwell%20GPUs&rft.btitle=2016%20IEEE%2023nd%20Symposium%20on%20Computer%20Arithmetic%20(ARITH)&rft.au=Emmart,%20Niall&rft.date=2016-07&rft.spage=47&rft.epage=54&rft.pages=47-54&rft.issn=1063-6889&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ARITH.2016.21&rft.eisbn=9781509016167&rft.eisbn_list=1509016163&rft_dat=%3Cieee_CHZPO%3E7563271%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-ee7e09235694443e06d17831a146e21f6fe622446ddf1ada015d859be23135a43%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7563271&rfr_iscdi=true