Loading…
Low-Latency Collectives for the Intel SCC
Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations we...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 354 |
container_issue | |
container_start_page | 346 |
container_title | |
container_volume | |
creator | Kohler, A. Radetzki, M. Gschwandtner, P. Fahringer, T. |
description | Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations were originally developed for macroscopic computer networks, the different characteristics of on-chip networks may require rethinking existing solutions. With the example of All reduce, we identify points where collective operations benefit from routines optimized for on-chip networks. The identified issues are then applied to additional collectives including Broadcast, All gather and All to all. The effectiveness of the proposed optimizations is demonstrated on the Single-Chip Cloud Computer (SCC), a many-core research chip created by Intel Labs. Experiments show that collective operations subjected to the identified optimizations are accelerated by factors roughly between 2 to 3 compared to current state of the art implementations. In addition to synthetic benchmarks, we show that the use of the optimized routines accelerates a scientific application by more than 40%. |
doi_str_mv | 10.1109/CLUSTER.2012.58 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6337797</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6337797</ieee_id><sourcerecordid>6337797</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-818ab2b69c4ad288a505d4b6731ad742d4c282e945d1c8b309ce93c6b816335a3</originalsourceid><addsrcrecordid>eNotjktLw0AURscXGGvWLtxk62Li3Dtz57GUoWohINh2XSaTKUZiI0lQ-u8b0G9zFgcOH2N3IEoA4R59tV1vlu8lCsCS7Bm7EUY7UlYYOmcZgrbcIckLljtjQWkjUSG6S5YBEXJCpa5ZPo6fYp4FK5zO2EPV__IqTOkQj4Xvuy7Fqf1JY7Hvh2L6SMXqMKWuWHt_y672oRtT_s8F2z4vN_6VV28vK_9U8RYMTXwOhxpr7aIKDVobSFCj6vkNhMYobFREi8kpaiDaWgoXk5NR1xa0lBTkgt3_dduU0u57aL_CcNzNzhhn5AmEPkQ6</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Low-Latency Collectives for the Intel SCC</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Kohler, A. ; Radetzki, M. ; Gschwandtner, P. ; Fahringer, T.</creator><creatorcontrib>Kohler, A. ; Radetzki, M. ; Gschwandtner, P. ; Fahringer, T.</creatorcontrib><description>Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations were originally developed for macroscopic computer networks, the different characteristics of on-chip networks may require rethinking existing solutions. With the example of All reduce, we identify points where collective operations benefit from routines optimized for on-chip networks. The identified issues are then applied to additional collectives including Broadcast, All gather and All to all. The effectiveness of the proposed optimizations is demonstrated on the Single-Chip Cloud Computer (SCC), a many-core research chip created by Intel Labs. Experiments show that collective operations subjected to the identified optimizations are accelerated by factors roughly between 2 to 3 compared to current state of the art implementations. In addition to synthetic benchmarks, we show that the use of the optimized routines accelerates a scientific application by more than 40%.</description><identifier>ISSN: 1552-5244</identifier><identifier>ISBN: 9781467324229</identifier><identifier>ISBN: 1467324221</identifier><identifier>EISSN: 2168-9253</identifier><identifier>EISBN: 0769548075</identifier><identifier>EISBN: 9780769548074</identifier><identifier>DOI: 10.1109/CLUSTER.2012.58</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Collective operations ; Computer architecture ; Libraries ; Many-core processors ; MPI ; Optimization ; Program processors ; Synchronization ; System-on-a-chip ; Vectors</subject><ispartof>2012 IEEE International Conference on Cluster Computing, 2012, p.346-354</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6337797$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,778,782,787,788,2054,27908,54538,54903,54915</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6337797$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kohler, A.</creatorcontrib><creatorcontrib>Radetzki, M.</creatorcontrib><creatorcontrib>Gschwandtner, P.</creatorcontrib><creatorcontrib>Fahringer, T.</creatorcontrib><title>Low-Latency Collectives for the Intel SCC</title><title>2012 IEEE International Conference on Cluster Computing</title><addtitle>CLUSTR</addtitle><description>Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations were originally developed for macroscopic computer networks, the different characteristics of on-chip networks may require rethinking existing solutions. With the example of All reduce, we identify points where collective operations benefit from routines optimized for on-chip networks. The identified issues are then applied to additional collectives including Broadcast, All gather and All to all. The effectiveness of the proposed optimizations is demonstrated on the Single-Chip Cloud Computer (SCC), a many-core research chip created by Intel Labs. Experiments show that collective operations subjected to the identified optimizations are accelerated by factors roughly between 2 to 3 compared to current state of the art implementations. In addition to synthetic benchmarks, we show that the use of the optimized routines accelerates a scientific application by more than 40%.</description><subject>Collective operations</subject><subject>Computer architecture</subject><subject>Libraries</subject><subject>Many-core processors</subject><subject>MPI</subject><subject>Optimization</subject><subject>Program processors</subject><subject>Synchronization</subject><subject>System-on-a-chip</subject><subject>Vectors</subject><issn>1552-5244</issn><issn>2168-9253</issn><isbn>9781467324229</isbn><isbn>1467324221</isbn><isbn>0769548075</isbn><isbn>9780769548074</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotjktLw0AURscXGGvWLtxk62Li3Dtz57GUoWohINh2XSaTKUZiI0lQ-u8b0G9zFgcOH2N3IEoA4R59tV1vlu8lCsCS7Bm7EUY7UlYYOmcZgrbcIckLljtjQWkjUSG6S5YBEXJCpa5ZPo6fYp4FK5zO2EPV__IqTOkQj4Xvuy7Fqf1JY7Hvh2L6SMXqMKWuWHt_y672oRtT_s8F2z4vN_6VV28vK_9U8RYMTXwOhxpr7aIKDVobSFCj6vkNhMYobFREi8kpaiDaWgoXk5NR1xa0lBTkgt3_dduU0u57aL_CcNzNzhhn5AmEPkQ6</recordid><startdate>201209</startdate><enddate>201209</enddate><creator>Kohler, A.</creator><creator>Radetzki, M.</creator><creator>Gschwandtner, P.</creator><creator>Fahringer, T.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201209</creationdate><title>Low-Latency Collectives for the Intel SCC</title><author>Kohler, A. ; Radetzki, M. ; Gschwandtner, P. ; Fahringer, T.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-818ab2b69c4ad288a505d4b6731ad742d4c282e945d1c8b309ce93c6b816335a3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Collective operations</topic><topic>Computer architecture</topic><topic>Libraries</topic><topic>Many-core processors</topic><topic>MPI</topic><topic>Optimization</topic><topic>Program processors</topic><topic>Synchronization</topic><topic>System-on-a-chip</topic><topic>Vectors</topic><toplevel>online_resources</toplevel><creatorcontrib>Kohler, A.</creatorcontrib><creatorcontrib>Radetzki, M.</creatorcontrib><creatorcontrib>Gschwandtner, P.</creatorcontrib><creatorcontrib>Fahringer, T.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore (Online service)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kohler, A.</au><au>Radetzki, M.</au><au>Gschwandtner, P.</au><au>Fahringer, T.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Low-Latency Collectives for the Intel SCC</atitle><btitle>2012 IEEE International Conference on Cluster Computing</btitle><stitle>CLUSTR</stitle><date>2012-09</date><risdate>2012</risdate><spage>346</spage><epage>354</epage><pages>346-354</pages><issn>1552-5244</issn><eissn>2168-9253</eissn><isbn>9781467324229</isbn><isbn>1467324221</isbn><eisbn>0769548075</eisbn><eisbn>9780769548074</eisbn><coden>IEEPAD</coden><abstract>Message passing has been adopted as the main programming paradigm for many-core processors with on-chip networks for inter-core communication. To this end, message-passing libraries such as MPI can be used, as they provide well-known interfaces to application developers. Since MPI implementations were originally developed for macroscopic computer networks, the different characteristics of on-chip networks may require rethinking existing solutions. With the example of All reduce, we identify points where collective operations benefit from routines optimized for on-chip networks. The identified issues are then applied to additional collectives including Broadcast, All gather and All to all. The effectiveness of the proposed optimizations is demonstrated on the Single-Chip Cloud Computer (SCC), a many-core research chip created by Intel Labs. Experiments show that collective operations subjected to the identified optimizations are accelerated by factors roughly between 2 to 3 compared to current state of the art implementations. In addition to synthetic benchmarks, we show that the use of the optimized routines accelerates a scientific application by more than 40%.</abstract><pub>IEEE</pub><doi>10.1109/CLUSTER.2012.58</doi><tpages>9</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1552-5244 |
ispartof | 2012 IEEE International Conference on Cluster Computing, 2012, p.346-354 |
issn | 1552-5244 2168-9253 |
language | eng |
recordid | cdi_ieee_primary_6337797 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Collective operations Computer architecture Libraries Many-core processors MPI Optimization Program processors Synchronization System-on-a-chip Vectors |
title | Low-Latency Collectives for the Intel SCC |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T05%3A28%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Low-Latency%20Collectives%20for%20the%20Intel%20SCC&rft.btitle=2012%20IEEE%20International%20Conference%20on%20Cluster%20Computing&rft.au=Kohler,%20A.&rft.date=2012-09&rft.spage=346&rft.epage=354&rft.pages=346-354&rft.issn=1552-5244&rft.eissn=2168-9253&rft.isbn=9781467324229&rft.isbn_list=1467324221&rft.coden=IEEPAD&rft_id=info:doi/10.1109/CLUSTER.2012.58&rft.eisbn=0769548075&rft.eisbn_list=9780769548074&rft_dat=%3Cieee_6IE%3E6337797%3C/ieee_6IE%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-818ab2b69c4ad288a505d4b6731ad742d4c282e945d1c8b309ce93c6b816335a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6337797&rfr_iscdi=true |