Loading…
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretic...
Saved in:
Published in: | arXiv.org 2014-10 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Malas, Tareq Hager, Georg Ltaief, Hatem Stengel, Holger Wellein, Gerhard Keyes, David |
description | The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor. |
doi_str_mv | 10.48550/arxiv.1410.3060 |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2083610358</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2083610358</sourcerecordid><originalsourceid>FETCH-LOGICAL-a518-23e724c824f1f52426d3d51394aa14b6373c29ca4967d4f0b164443679d5a3</originalsourceid><addsrcrecordid>eNotjUtLAzEUhYMgWGr3LgOupya5N5nMUoovaBHBfcnkIanTyTjJVPHX22JXh_Nx-A4hN5wtUUvJ7sz4Ew9LjkcATLELMhMAvNIoxBVZ5LxjjAlVCylhRjabqSvRptFXaShxH3-9o9_m4MOY-kJdNPvUO9p2yX7G_oOGNNLz8FRz8b2NHZ0GZ4rP1-QymC77xTnn5O3x4X31XK1fn15W9-vKSK4rAb4WaLXAwIMUKJQDJzk0aAzHVkENVjTWYKNqh4G1XCEiqLpx0sCc3P47hzF9TT6X7S5NY3-82wqmQXEGUsMfZKhOgQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2083610358</pqid></control><display><type>article</type><title>Multicore-optimized wavefront diamond blocking for optimizing stencil updates</title><source>Publicly Available Content Database</source><creator>Malas, Tareq ; Hager, Georg ; Ltaief, Hatem ; Stengel, Holger ; Wellein, Gerhard ; Keyes, David</creator><creatorcontrib>Malas, Tareq ; Hager, Georg ; Ltaief, Hatem ; Stengel, Holger ; Wellein, Gerhard ; Keyes, David</creatorcontrib><description>The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.1410.3060</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Central processing units ; Concurrency ; CPUs ; Data paths ; Diamonds ; Microprocessors ; Tiling</subject><ispartof>arXiv.org, 2014-10</ispartof><rights>2014. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2083610358?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25732,27904,36991,44569</link.rule.ids></links><search><creatorcontrib>Malas, Tareq</creatorcontrib><creatorcontrib>Hager, Georg</creatorcontrib><creatorcontrib>Ltaief, Hatem</creatorcontrib><creatorcontrib>Stengel, Holger</creatorcontrib><creatorcontrib>Wellein, Gerhard</creatorcontrib><creatorcontrib>Keyes, David</creatorcontrib><title>Multicore-optimized wavefront diamond blocking for optimizing stencil updates</title><title>arXiv.org</title><description>The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.</description><subject>Algorithms</subject><subject>Central processing units</subject><subject>Concurrency</subject><subject>CPUs</subject><subject>Data paths</subject><subject>Diamonds</subject><subject>Microprocessors</subject><subject>Tiling</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNotjUtLAzEUhYMgWGr3LgOupya5N5nMUoovaBHBfcnkIanTyTjJVPHX22JXh_Nx-A4hN5wtUUvJ7sz4Ew9LjkcATLELMhMAvNIoxBVZ5LxjjAlVCylhRjabqSvRptFXaShxH3-9o9_m4MOY-kJdNPvUO9p2yX7G_oOGNNLz8FRz8b2NHZ0GZ4rP1-QymC77xTnn5O3x4X31XK1fn15W9-vKSK4rAb4WaLXAwIMUKJQDJzk0aAzHVkENVjTWYKNqh4G1XCEiqLpx0sCc3P47hzF9TT6X7S5NY3-82wqmQXEGUsMfZKhOgQ</recordid><startdate>20141012</startdate><enddate>20141012</enddate><creator>Malas, Tareq</creator><creator>Hager, Georg</creator><creator>Ltaief, Hatem</creator><creator>Stengel, Holger</creator><creator>Wellein, Gerhard</creator><creator>Keyes, David</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20141012</creationdate><title>Multicore-optimized wavefront diamond blocking for optimizing stencil updates</title><author>Malas, Tareq ; Hager, Georg ; Ltaief, Hatem ; Stengel, Holger ; Wellein, Gerhard ; Keyes, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a518-23e724c824f1f52426d3d51394aa14b6373c29ca4967d4f0b164443679d5a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Central processing units</topic><topic>Concurrency</topic><topic>CPUs</topic><topic>Data paths</topic><topic>Diamonds</topic><topic>Microprocessors</topic><topic>Tiling</topic><toplevel>online_resources</toplevel><creatorcontrib>Malas, Tareq</creatorcontrib><creatorcontrib>Hager, Georg</creatorcontrib><creatorcontrib>Ltaief, Hatem</creatorcontrib><creatorcontrib>Stengel, Holger</creatorcontrib><creatorcontrib>Wellein, Gerhard</creatorcontrib><creatorcontrib>Keyes, David</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Malas, Tareq</au><au>Hager, Georg</au><au>Ltaief, Hatem</au><au>Stengel, Holger</au><au>Wellein, Gerhard</au><au>Keyes, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multicore-optimized wavefront diamond blocking for optimizing stencil updates</atitle><jtitle>arXiv.org</jtitle><date>2014-10-12</date><risdate>2014</risdate><eissn>2331-8422</eissn><abstract>The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.1410.3060</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2014-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2083610358 |
source | Publicly Available Content Database |
subjects | Algorithms Central processing units Concurrency CPUs Data paths Diamonds Microprocessors Tiling |
title | Multicore-optimized wavefront diamond blocking for optimizing stencil updates |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T06%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multicore-optimized%20wavefront%20diamond%20blocking%20for%20optimizing%20stencil%20updates&rft.jtitle=arXiv.org&rft.au=Malas,%20Tareq&rft.date=2014-10-12&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.1410.3060&rft_dat=%3Cproquest%3E2083610358%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a518-23e724c824f1f52426d3d51394aa14b6373c29ca4967d4f0b164443679d5a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2083610358&rft_id=info:pmid/&rfr_iscdi=true |