Loading…

Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment...

Full description

Saved in:
Bibliographic Details
Main Authors: Song, Fengguang, Ltaief, Hatem, Hadri, Bilel, Dongarra, Jack
Format: Conference Proceeding
Language:English
Subjects:
Citations: Items that cite this one
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-a193t-b0b5cd4c1761f2f587c4e746d6eb2742ba45c47e544841f322bdbb6c038098a83
cites
container_end_page 11
container_issue
container_start_page 1
container_title
container_volume
creator Song, Fengguang
Ltaief, Hatem
Hadri, Bilel
Dongarra, Jack
description As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.
doi_str_mv 10.1109/SC.2010.48
format conference_proceeding
fullrecord <record><control><sourceid>acm_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_5645553</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5645553</ieee_id><sourcerecordid>acm_books_10_1109_SC_2010_48</sourcerecordid><originalsourceid>FETCH-LOGICAL-a193t-b0b5cd4c1761f2f587c4e746d6eb2742ba45c47e544841f322bdbb6c038098a83</originalsourceid><addsrcrecordid>eNqFkL1PwzAQxc2XRCldWFkiMTGk-OMc22MVUUAqQjRltmzHQYakRkmKVP56UopYOZ3u6fS7e8ND6ILgKSFY3RT5lOJhAXmAzghQAMG5EodoREkmUmBMHKGJEvKPqeM_RtUpmnTdGx5KCWCcjdCycKY2tvbJKgwjj02zWQdn-hDX6ewzhjKsX5PnZTI3ro9t-PohydCPm7oPLrbDU73pet8mxXaQpjtHJ5WpOz_51TF6md-u8vt08XT3kM8WqSGK9anFlrsSHBEZqWjFpXDgBWRl5i0VQK0B7kB4DiCBVIxSW1qbOcwkVtJINkaXe9_gvdcfbWhMu9U8A845G-jVnhrXaBvje6cJ1rsQdZHrXYgadh7X_19p2wZfsW8OhWpj</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems</title><source>IEEE Xplore All Conference Series</source><creator>Song, Fengguang ; Ltaief, Hatem ; Hadri, Bilel ; Dongarra, Jack</creator><creatorcontrib>Song, Fengguang ; Ltaief, Hatem ; Hadri, Bilel ; Dongarra, Jack</creatorcontrib><description>As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.</description><identifier>ISSN: 2167-4329</identifier><identifier>ISBN: 9781424475599</identifier><identifier>ISBN: 1424475597</identifier><identifier>ISBN: 9781424475575</identifier><identifier>ISBN: 1424475570</identifier><identifier>EISSN: 2167-4337</identifier><identifier>EISBN: 1424475597</identifier><identifier>EISBN: 9781424475582</identifier><identifier>EISBN: 9781424475599</identifier><identifier>EISBN: 1424475589</identifier><identifier>DOI: 10.1109/SC.2010.48</identifier><language>eng</language><publisher>Washington, DC, USA: IEEE Computer Society</publisher><subject>Algorithm design and analysis ; Computer systems organization ; Computer systems organization -- Dependable and fault-tolerant systems and networks ; General and reference ; General and reference -- Cross-computing tools and techniques ; General and reference -- Cross-computing tools and techniques -- Performance ; Kernel ; Libraries ; Multicore processing ; Networks ; Networks -- Network performance evaluation ; Runtime ; Social and professional topics ; Social and professional topics -- Professional topics ; Social and professional topics -- Professional topics -- Computing profession ; Social and professional topics -- Professional topics -- Computing profession -- Testing, certification and licensing ; Tiles</subject><ispartof>2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, p.1-11</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a193t-b0b5cd4c1761f2f587c4e746d6eb2742ba45c47e544841f322bdbb6c038098a83</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5645553$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54555,54920,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5645553$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Song, Fengguang</creatorcontrib><creatorcontrib>Ltaief, Hatem</creatorcontrib><creatorcontrib>Hadri, Bilel</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><title>Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems</title><title>2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis</title><addtitle>SC</addtitle><description>As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.</description><subject>Algorithm design and analysis</subject><subject>Computer systems organization</subject><subject>Computer systems organization -- Dependable and fault-tolerant systems and networks</subject><subject>General and reference</subject><subject>General and reference -- Cross-computing tools and techniques</subject><subject>General and reference -- Cross-computing tools and techniques -- Performance</subject><subject>Kernel</subject><subject>Libraries</subject><subject>Multicore processing</subject><subject>Networks</subject><subject>Networks -- Network performance evaluation</subject><subject>Runtime</subject><subject>Social and professional topics</subject><subject>Social and professional topics -- Professional topics</subject><subject>Social and professional topics -- Professional topics -- Computing profession</subject><subject>Social and professional topics -- Professional topics -- Computing profession -- Testing, certification and licensing</subject><subject>Tiles</subject><issn>2167-4329</issn><issn>2167-4337</issn><isbn>9781424475599</isbn><isbn>1424475597</isbn><isbn>9781424475575</isbn><isbn>1424475570</isbn><isbn>1424475597</isbn><isbn>9781424475582</isbn><isbn>9781424475599</isbn><isbn>1424475589</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNqFkL1PwzAQxc2XRCldWFkiMTGk-OMc22MVUUAqQjRltmzHQYakRkmKVP56UopYOZ3u6fS7e8ND6ILgKSFY3RT5lOJhAXmAzghQAMG5EodoREkmUmBMHKGJEvKPqeM_RtUpmnTdGx5KCWCcjdCycKY2tvbJKgwjj02zWQdn-hDX6ewzhjKsX5PnZTI3ro9t-PohydCPm7oPLrbDU73pet8mxXaQpjtHJ5WpOz_51TF6md-u8vt08XT3kM8WqSGK9anFlrsSHBEZqWjFpXDgBWRl5i0VQK0B7kB4DiCBVIxSW1qbOcwkVtJINkaXe9_gvdcfbWhMu9U8A845G-jVnhrXaBvje6cJ1rsQdZHrXYgadh7X_19p2wZfsW8OhWpj</recordid><startdate>20101113</startdate><enddate>20101113</enddate><creator>Song, Fengguang</creator><creator>Ltaief, Hatem</creator><creator>Hadri, Bilel</creator><creator>Dongarra, Jack</creator><general>IEEE Computer Society</general><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20101113</creationdate><title>Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems</title><author>Song, Fengguang ; Ltaief, Hatem ; Hadri, Bilel ; Dongarra, Jack</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a193t-b0b5cd4c1761f2f587c4e746d6eb2742ba45c47e544841f322bdbb6c038098a83</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Algorithm design and analysis</topic><topic>Computer systems organization</topic><topic>Computer systems organization -- Dependable and fault-tolerant systems and networks</topic><topic>General and reference</topic><topic>General and reference -- Cross-computing tools and techniques</topic><topic>General and reference -- Cross-computing tools and techniques -- Performance</topic><topic>Kernel</topic><topic>Libraries</topic><topic>Multicore processing</topic><topic>Networks</topic><topic>Networks -- Network performance evaluation</topic><topic>Runtime</topic><topic>Social and professional topics</topic><topic>Social and professional topics -- Professional topics</topic><topic>Social and professional topics -- Professional topics -- Computing profession</topic><topic>Social and professional topics -- Professional topics -- Computing profession -- Testing, certification and licensing</topic><topic>Tiles</topic><toplevel>online_resources</toplevel><creatorcontrib>Song, Fengguang</creatorcontrib><creatorcontrib>Ltaief, Hatem</creatorcontrib><creatorcontrib>Hadri, Bilel</creatorcontrib><creatorcontrib>Dongarra, Jack</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library Online</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Song, Fengguang</au><au>Ltaief, Hatem</au><au>Hadri, Bilel</au><au>Dongarra, Jack</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems</atitle><btitle>2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis</btitle><stitle>SC</stitle><date>2010-11-13</date><risdate>2010</risdate><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2167-4329</issn><eissn>2167-4337</eissn><isbn>9781424475599</isbn><isbn>1424475597</isbn><isbn>9781424475575</isbn><isbn>1424475570</isbn><eisbn>1424475597</eisbn><eisbn>9781424475582</eisbn><eisbn>9781424475599</eisbn><eisbn>1424475589</eisbn><abstract>As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.</abstract><cop>Washington, DC, USA</cop><pub>IEEE Computer Society</pub><doi>10.1109/SC.2010.48</doi><tpages>11</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2167-4329
ispartof 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, p.1-11
issn 2167-4329
2167-4337
language eng
recordid cdi_ieee_primary_5645553
source IEEE Xplore All Conference Series
subjects Algorithm design and analysis
Computer systems organization
Computer systems organization -- Dependable and fault-tolerant systems and networks
General and reference
General and reference -- Cross-computing tools and techniques
General and reference -- Cross-computing tools and techniques -- Performance
Kernel
Libraries
Multicore processing
Networks
Networks -- Network performance evaluation
Runtime
Social and professional topics
Social and professional topics -- Professional topics
Social and professional topics -- Professional topics -- Computing profession
Social and professional topics -- Professional topics -- Computing profession -- Testing, certification and licensing
Tiles
title Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T08%3A58%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Scalable%20Tile%20Communication-Avoiding%20QR%20Factorization%20on%20Multicore%20Cluster%20Systems&rft.btitle=2010%20ACM/IEEE%20International%20Conference%20for%20High%20Performance%20Computing,%20Networking,%20Storage%20and%20Analysis&rft.au=Song,%20Fengguang&rft.date=2010-11-13&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2167-4329&rft.eissn=2167-4337&rft.isbn=9781424475599&rft.isbn_list=1424475597&rft.isbn_list=9781424475575&rft.isbn_list=1424475570&rft_id=info:doi/10.1109/SC.2010.48&rft.eisbn=1424475597&rft.eisbn_list=9781424475582&rft.eisbn_list=9781424475599&rft.eisbn_list=1424475589&rft_dat=%3Cacm_CHZPO%3Eacm_books_10_1109_SC_2010_48%3C/acm_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a193t-b0b5cd4c1761f2f587c4e746d6eb2742ba45c47e544841f322bdbb6c038098a83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5645553&rfr_iscdi=true