Loading…

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on architecture and code optimization 2018-10, Vol.15 (3), p.1-23
Main Authors: Kim, Hyojong, Hadidi, Ramyad, Nai, Lifeng, Kim, Hyesoon, Jayasena, Nuwan, Eckert, Yasuko, Kayiran, Onur, Loh, Gabriel
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23
container_end_page 23
container_issue 3
container_start_page 1
container_title ACM transactions on architecture and code optimization
container_volume 15
creator Kim, Hyojong
Hadidi, Ramyad
Nai, Lifeng
Kim, Hyesoon
Jayasena, Nuwan
Eckert, Yasuko
Kayiran, Onur
Loh, Gabriel
description To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.
doi_str_mv 10.1145/3232521
format article
fullrecord <record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3232521</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3232521</sourcerecordid><originalsourceid>FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</originalsourceid><addsrcrecordid>eNo1jssKwjAQAIMoWKv4GZ6q2d0mbY6lPkHwoueSJllQFKXx4t-LqKeZ0zBCTEHOAXK1ICRUCD2RgMrzjExB_b8rrYdiFONFSjQoZSL69WFZjcWA7TWGyY-pOK1Xx3qb7Q-bXV3tMwdl8cw8l60BROstKfahZQ2-DUxSgy6NVi54E6j1zI7YqQJzjdaSNMGjZqRUzL5d191j7AI3j-58s92rAdl85pvfPL0B5mc0EA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</creator><creatorcontrib>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</creatorcontrib><description>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</description><identifier>ISSN: 1544-3566</identifier><identifier>EISSN: 1544-3973</identifier><identifier>DOI: 10.1145/3232521</identifier><language>eng</language><ispartof>ACM transactions on architecture and code optimization, 2018-10, Vol.15 (3), p.1-23</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</cites><orcidid>0000-0002-8801-9384 ; 0000-0002-8731-1084</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Kim, Hyojong</creatorcontrib><creatorcontrib>Hadidi, Ramyad</creatorcontrib><creatorcontrib>Nai, Lifeng</creatorcontrib><creatorcontrib>Kim, Hyesoon</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Eckert, Yasuko</creatorcontrib><creatorcontrib>Kayiran, Onur</creatorcontrib><creatorcontrib>Loh, Gabriel</creatorcontrib><title>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</title><title>ACM transactions on architecture and code optimization</title><description>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</description><issn>1544-3566</issn><issn>1544-3973</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNo1jssKwjAQAIMoWKv4GZ6q2d0mbY6lPkHwoueSJllQFKXx4t-LqKeZ0zBCTEHOAXK1ICRUCD2RgMrzjExB_b8rrYdiFONFSjQoZSL69WFZjcWA7TWGyY-pOK1Xx3qb7Q-bXV3tMwdl8cw8l60BROstKfahZQ2-DUxSgy6NVi54E6j1zI7YqQJzjdaSNMGjZqRUzL5d191j7AI3j-58s92rAdl85pvfPL0B5mc0EA</recordid><startdate>20181001</startdate><enddate>20181001</enddate><creator>Kim, Hyojong</creator><creator>Hadidi, Ramyad</creator><creator>Nai, Lifeng</creator><creator>Kim, Hyesoon</creator><creator>Jayasena, Nuwan</creator><creator>Eckert, Yasuko</creator><creator>Kayiran, Onur</creator><creator>Loh, Gabriel</creator><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8801-9384</orcidid><orcidid>https://orcid.org/0000-0002-8731-1084</orcidid></search><sort><creationdate>20181001</creationdate><title>CODA</title><author>Kim, Hyojong ; Hadidi, Ramyad ; Nai, Lifeng ; Kim, Hyesoon ; Jayasena, Nuwan ; Eckert, Yasuko ; Kayiran, Onur ; Loh, Gabriel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kim, Hyojong</creatorcontrib><creatorcontrib>Hadidi, Ramyad</creatorcontrib><creatorcontrib>Nai, Lifeng</creatorcontrib><creatorcontrib>Kim, Hyesoon</creatorcontrib><creatorcontrib>Jayasena, Nuwan</creatorcontrib><creatorcontrib>Eckert, Yasuko</creatorcontrib><creatorcontrib>Kayiran, Onur</creatorcontrib><creatorcontrib>Loh, Gabriel</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on architecture and code optimization</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Hyojong</au><au>Hadidi, Ramyad</au><au>Nai, Lifeng</au><au>Kim, Hyesoon</au><au>Jayasena, Nuwan</au><au>Eckert, Yasuko</au><au>Kayiran, Onur</au><au>Loh, Gabriel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems</atitle><jtitle>ACM transactions on architecture and code optimization</jtitle><date>2018-10-01</date><risdate>2018</risdate><volume>15</volume><issue>3</issue><spage>1</spage><epage>23</epage><pages>1-23</pages><issn>1544-3566</issn><eissn>1544-3973</eissn><abstract>To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.</abstract><doi>10.1145/3232521</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0002-8801-9384</orcidid><orcidid>https://orcid.org/0000-0002-8731-1084</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1544-3566
ispartof ACM transactions on architecture and code optimization, 2018-10, Vol.15 (3), p.1-23
issn 1544-3566
1544-3973
language eng
recordid cdi_crossref_primary_10_1145_3232521
source Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
title CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T02%3A35%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CODA:%20Enabling%20Co-location%20of%20Computation%20and%20Data%20for%20Multiple%20GPU%20Systems&rft.jtitle=ACM%20transactions%20on%20architecture%20and%20code%20optimization&rft.au=Kim,%20Hyojong&rft.date=2018-10-01&rft.volume=15&rft.issue=3&rft.spage=1&rft.epage=23&rft.pages=1-23&rft.issn=1544-3566&rft.eissn=1544-3973&rft_id=info:doi/10.1145/3232521&rft_dat=%3Ccrossref%3E10_1145_3232521%3C/crossref%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c187t-df8b9122ada35fdebf61dbef306168965ced9e3bdffc3fc572462aa309ed26f23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true