Loading…

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of s...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lima, J. V. F., Gautier, T., Maillard, N., Danjean, V.
Format:	Conference Proceeding
Language:	English
Subjects:	Computer architecture data flow model dense linear algebra Graphics processing units Kernel Linear algebra multi-GPUs Runtime work stealing
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	82
container_issue
container_start_page	75
container_title
container_volume
creator	Lima, J. V. F. Gautier, T. Maillard, N. Danjean, V.
description	The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.
doi_str_mv	10.1109/SBAC-PAD.2012.28
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6374774</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6374774</ieee_id><sourcerecordid>6374774</sourcerecordid><originalsourceid>FETCH-LOGICAL-h1668-3d376731411546d28d7d16aa0bd93f5e858d2311600f24861e2e55c42b5d087b3</originalsourceid><addsrcrecordid>eNotjktPAjEUhesrEZC9iZv-geK9fdx2ljgimqCQIHFJBtrR6jhDOkOi_16Mrs7iO9_JYewSYYQI2fXyZpyLxfh2JAHlSLoj1gdLmdEZWDxmPUlaCQWAJ6yPmqzSNgM6ZT00BgQZpc7ZsG3f4VABZcBRjz1NvnZVE7tYv_K8qbf7lELd8elixee7kIouNnXLyybxSVnGbfyFL0364MsuFNWv1dT8cV91URyc9oKdlUXVhuF_DtjqbvKc34vZfPqQj2fiDYmcUF7Zwz_UiEaTl85bj1QUsPGZKk1wxnmpEAmglNoRBhmM2Wq5MR6c3agBu_rbjSGE9S7FzyJ9r0lZba1WP3wGUHs</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Lima, J. V. F. ; Gautier, T. ; Maillard, N. ; Danjean, V.</creator><creatorcontrib>Lima, J. V. F. ; Gautier, T. ; Maillard, N. ; Danjean, V.</creatorcontrib><description>The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.</description><identifier>ISSN: 1550-6533</identifier><identifier>ISBN: 1467347906</identifier><identifier>ISBN: 9781467347907</identifier><identifier>EISSN: 2643-3001</identifier><identifier>EISBN: 0769549071</identifier><identifier>EISBN: 9780769549071</identifier><identifier>DOI: 10.1109/SBAC-PAD.2012.28</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer architecture ; data flow model ; dense linear algebra ; Graphics processing units ; Kernel ; Linear algebra ; multi-GPUs ; Runtime ; work stealing</subject><ispartof>2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012, p.75-82</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6374774$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54555,54920,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6374774$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Lima, J. V. F.</creatorcontrib><creatorcontrib>Gautier, T.</creatorcontrib><creatorcontrib>Maillard, N.</creatorcontrib><creatorcontrib>Danjean, V.</creatorcontrib><title>Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs</title><title>2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing</title><addtitle>CAHPC</addtitle><description>The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.</description><subject>Computer architecture</subject><subject>data flow model</subject><subject>dense linear algebra</subject><subject>Graphics processing units</subject><subject>Kernel</subject><subject>Linear algebra</subject><subject>multi-GPUs</subject><subject>Runtime</subject><subject>work stealing</subject><issn>1550-6533</issn><issn>2643-3001</issn><isbn>1467347906</isbn><isbn>9781467347907</isbn><isbn>0769549071</isbn><isbn>9780769549071</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotjktPAjEUhesrEZC9iZv-geK9fdx2ljgimqCQIHFJBtrR6jhDOkOi_16Mrs7iO9_JYewSYYQI2fXyZpyLxfh2JAHlSLoj1gdLmdEZWDxmPUlaCQWAJ6yPmqzSNgM6ZT00BgQZpc7ZsG3f4VABZcBRjz1NvnZVE7tYv_K8qbf7lELd8elixee7kIouNnXLyybxSVnGbfyFL0364MsuFNWv1dT8cV91URyc9oKdlUXVhuF_DtjqbvKc34vZfPqQj2fiDYmcUF7Zwz_UiEaTl85bj1QUsPGZKk1wxnmpEAmglNoRBhmM2Wq5MR6c3agBu_rbjSGE9S7FzyJ9r0lZba1WP3wGUHs</recordid><startdate>201210</startdate><enddate>201210</enddate><creator>Lima, J. V. F.</creator><creator>Gautier, T.</creator><creator>Maillard, N.</creator><creator>Danjean, V.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201210</creationdate><title>Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs</title><author>Lima, J. V. F. ; Gautier, T. ; Maillard, N. ; Danjean, V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-h1668-3d376731411546d28d7d16aa0bd93f5e858d2311600f24861e2e55c42b5d087b3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Computer architecture</topic><topic>data flow model</topic><topic>dense linear algebra</topic><topic>Graphics processing units</topic><topic>Kernel</topic><topic>Linear algebra</topic><topic>multi-GPUs</topic><topic>Runtime</topic><topic>work stealing</topic><toplevel>online_resources</toplevel><creatorcontrib>Lima, J. V. F.</creatorcontrib><creatorcontrib>Gautier, T.</creatorcontrib><creatorcontrib>Maillard, N.</creatorcontrib><creatorcontrib>Danjean, V.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEL</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lima, J. V. F.</au><au>Gautier, T.</au><au>Maillard, N.</au><au>Danjean, V.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs</atitle><btitle>2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing</btitle><stitle>CAHPC</stitle><date>2012-10</date><risdate>2012</risdate><spage>75</spage><epage>82</epage><pages>75-82</pages><issn>1550-6533</issn><eissn>2643-3001</eissn><isbn>1467347906</isbn><isbn>9781467347907</isbn><eisbn>0769549071</eisbn><eisbn>9780769549071</eisbn><coden>IEEPAD</coden><abstract>The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.</abstract><pub>IEEE</pub><doi>10.1109/SBAC-PAD.2012.28</doi><tpages>8</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1550-6533
ispartof	2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012, p.75-82
issn	1550-6533 2643-3001
language	eng
recordid	cdi_ieee_primary_6374774
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Computer architecture data flow model dense linear algebra Graphics processing units Kernel Linear algebra multi-GPUs Runtime work stealing
title	Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T14%3A52%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Exploiting%20Concurrent%20GPU%20Operations%20for%20Efficient%20Work%20Stealing%20on%20Multi-GPUs&rft.btitle=2012%20IEEE%2024th%20International%20Symposium%20on%20Computer%20Architecture%20and%20High%20Performance%20Computing&rft.au=Lima,%20J.%20V.%20F.&rft.date=2012-10&rft.spage=75&rft.epage=82&rft.pages=75-82&rft.issn=1550-6533&rft.eissn=2643-3001&rft.isbn=1467347906&rft.isbn_list=9781467347907&rft.coden=IEEPAD&rft_id=info:doi/10.1109/SBAC-PAD.2012.28&rft.eisbn=0769549071&rft.eisbn_list=9780769549071&rft_dat=%3Cieee_6IE%3E6374774%3C/ieee_6IE%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-h1668-3d376731411546d28d7d16aa0bd93f5e858d2311600f24861e2e55c42b5d087b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6374774&rfr_iscdi=true