Loading…

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional...

Full description

Saved in:

Bibliographic Details
Published in:	ACM transactions on architecture and code optimization 2018-10, Vol.15 (3), p.1-23
Main Authors:	Kim, Hyojong, Hadidi, Ramyad, Nai, Lifeng, Kim, Hyesoon, Jayasena, Nuwan, Eckert, Yasuko, Kayiran, Onur, Loh, Gabriel
Format:	Article
Language:	English
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23
container_end_page	23
container_issue	3
container_start_page	1
container_title	ACM transactions on architecture and code optimization
container_volume	15
creator	Kim, Hyojong Hadidi, Ramyad Nai, Lifeng Kim, Hyesoon Jayasena, Nuwan Eckert, Yasuko Kayiran, Onur Loh, Gabriel
description	To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.
doi_str_mv	10.1145/3232521
format	article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3232521</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3232521</sourcerecordid><originalsourceid>FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</originalsourceid><addsrcrecordid>eNo1jssKwjAQAIMoWKv4GZ6q2d0mbY6lPkHwoueSJllQFKXx4t-LqKeZ0zBCTEHOAXK1ICRUCD2RgMrzjExB_b8rrYdiFONFSjQoZSL69WFZjcWA7TWGyY-pOK1Xx3qb7Q-bXV3tMwdl8cw8l60BROstKfahZQ2-DUxSgy6NVi54E6j1zI7YqQJzjdaSNMGjZqRUzL5d191j7AI3j-58s92rAdl85pvfPL0B5mc0EA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</creator><creatorcontrib>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</creatorcontrib><description>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</description><identifier>ISSN: 1544-3566</identifier><identifier>EISSN: 1544-3973</identifier><identifier>DOI: 10.1145/3232521</identifier><language>eng</language><ispartof>ACM transactions on architecture and code optimization, 2018-10, Vol.15 (3), p.1-23</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</cites><orcidid>0000-0002-8801-9384 ; 0000-0002-8731-1084</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Kim, Hyojong</creatorcontrib><creatorcontrib>Hadidi, Ramyad</creatorcontrib><creatorcontrib>Nai, Lifeng</creatorcontrib><creatorcontrib>Kim, Hyesoon</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Eckert, Yasuko</creatorcontrib><creatorcontrib>Kayiran, Onur</creatorcontrib><creatorcontrib>Loh, Gabriel</creatorcontrib><title>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</title><title>ACM transactions on architecture and code optimization</title><description>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</description><issn>1544-3566</issn><issn>1544-3973</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNo1jssKwjAQAIMoWKv4GZ6q2d0mbY6lPkHwoueSJllQFKXx4t-LqKeZ0zBCTEHOAXK1ICRUCD2RgMrzjExB_b8rrYdiFONFSjQoZSL69WFZjcWA7TWGyY-pOK1Xx3qb7Q-bXV3tMwdl8cw8l60BROstKfahZQ2-DUxSgy6NVi54E6j1zI7YqQJzjdaSNMGjZqRUzL5d191j7AI3j-58s92rAdl85pvfPL0B5mc0EA</recordid><startdate>20181001</startdate><enddate>20181001</enddate><creator>Kim, Hyojong</creator><creator>Hadidi, Ramyad</creator><creator>Nai, Lifeng</creator><creator>Kim, Hyesoon</creator><creator>Jayasena, Nuwan</creator><creator>Eckert, Yasuko</creator><creator>Kayiran, Onur</creator><creator>Loh, Gabriel</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8801-9384</orcidid><orcidid>https://orcid.org/0000-0002-8731-1084</orcidid></search><sort><creationdate>20181001</creationdate><title>CODA</title><author>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kim, Hyojong</creatorcontrib><creatorcontrib>Hadidi, Ramyad</creatorcontrib><creatorcontrib>Nai, Lifeng</creatorcontrib><creatorcontrib>Kim, Hyesoon</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Eckert, Yasuko</creatorcontrib><creatorcontrib>Kayiran, Onur</creatorcontrib><creatorcontrib>Loh, Gabriel</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on architecture and code optimization</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Hyojong</au><au>Hadidi, Ramyad</au><au>Nai, Lifeng</au><au>Kim, Hyesoon</au><au>Jayasena, Nuwan</au><au>Eckert, Yasuko</au><au>Kayiran, Onur</au><au>Loh, Gabriel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</atitle><jtitle>ACM transactions on architecture and code optimization</jtitle><date>2018-10-01</date><risdate>2018</risdate><volume>15</volume><issue>3</issue><spage>1</spage><epage>23</epage><pages>1-23</pages><issn>1544-3566</issn><eissn>1544-3973</eissn><abstract>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</abstract><doi>10.1145/3232521</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0002-8801-9384</orcidid><orcidid>https://orcid.org/0000-0002-8731-1084</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1544-3566
ispartof	ACM transactions on architecture and code optimization, 2018-10, Vol.15 (3), p.1-23
issn	1544-3566 1544-3973
language	eng
recordid	cdi_crossref_primary_10_1145_3232521
source	Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
title	CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T02%3A35%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CODA:%20Enabling%20Co-location%20of%20Computation%20and%20Data%20for%20Multiple%20GPU%20Systems&rft.jtitle=ACM%20transactions%20on%20architecture%20and%20code%20optimization&rft.au=Kim,%20Hyojong&rft.date=2018-10-01&rft.volume=15&rft.issue=3&rft.spage=1&rft.epage=23&rft.pages=1-23&rft.issn=1544-3566&rft.eissn=1544-3973&rft_id=info:doi/10.1145/3232521&rft_dat=%3Ccrossref%3E10_1145_3232521%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true