Loading…

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance

The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on chec...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sloan, Joseph, Kumar, Rakesh, Bronevetsky, Greg
Format:	Conference Proceeding
Language:	English
Subjects:	algorithmic error correction Circuit faults Context Error analysis error localization Fault tolerance Fault tolerant systems numerical methods partial recomputation sparse linear algebra Sparse matrices Vectors
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	12
container_issue
container_start_page	1
container_title
container_volume
creator	Sloan, Joseph Kumar, Rakesh Bronevetsky, Greg
description	The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.
doi_str_mv	10.1109/DSN.2013.6575309
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6575309</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6575309</ieee_id><sourcerecordid>6575309</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-3f9eaec2942637105180bfaae3f62cf75200d1da9cd7ce590fd60bdad438c3143</originalsourceid><addsrcrecordid>eNpVkLtOAzEURM1LIgrpkWj8A7tc2-u1XUbhKUVQAHV04wdr5MQrZwOCryciaaimODOnGEIuGdSMgbm-eXmqOTBRt1JJAeaITIzSrGmVaBvF4ZiMOJO6Eoark3-MiVMyYrtNBVqbczLZbD4AgIFoWq1HpJuuKab3XOLQraKl2Pclo-3okKkvJReassUUf3CIeVddO9pjGSImWrzNq3477En4q35V-dOXzqOjAbdp2GmSL7i2_oKcBUwbPznkmLzd3b7OHqr58_3jbDqvIlNyqEQwHr3lpuGtUAwk07AMiF6EltugJAdwzKGxTlkvDQTXwtKha4S2gjViTK723ui9X_QlrrB8Lw6_iV_QJV7Y</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Sloan, Joseph ; Kumar, Rakesh ; Bronevetsky, Greg</creator><creatorcontrib>Sloan, Joseph ; Kumar, Rakesh ; Bronevetsky, Greg</creatorcontrib><description>The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.</description><identifier>ISSN: 1530-0889</identifier><identifier>ISBN: 9781467364713</identifier><identifier>ISBN: 1467364711</identifier><identifier>EISSN: 2158-3927</identifier><identifier>EISBN: 9781467364720</identifier><identifier>EISBN: 146736472X</identifier><identifier>DOI: 10.1109/DSN.2013.6575309</identifier><language>eng</language><publisher>IEEE</publisher><subject>algorithmic error correction ; Circuit faults ; Context ; Error analysis ; error localization ; Fault tolerance ; Fault tolerant systems ; numerical methods ; partial recomputation ; sparse linear algebra ; Sparse matrices ; Vectors</subject><ispartof>2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, p.1-12</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6575309$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2056,27924,54554,54919,54931</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6575309$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Sloan, Joseph</creatorcontrib><creatorcontrib>Kumar, Rakesh</creatorcontrib><creatorcontrib>Bronevetsky, Greg</creatorcontrib><title>An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance</title><title>2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)</title><addtitle>DSN</addtitle><description>The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.</description><subject>algorithmic error correction</subject><subject>Circuit faults</subject><subject>Context</subject><subject>Error analysis</subject><subject>error localization</subject><subject>Fault tolerance</subject><subject>Fault tolerant systems</subject><subject>numerical methods</subject><subject>partial recomputation</subject><subject>sparse linear algebra</subject><subject>Sparse matrices</subject><subject>Vectors</subject><issn>1530-0889</issn><issn>2158-3927</issn><isbn>9781467364713</isbn><isbn>1467364711</isbn><isbn>9781467364720</isbn><isbn>146736472X</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2013</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNpVkLtOAzEURM1LIgrpkWj8A7tc2-u1XUbhKUVQAHV04wdr5MQrZwOCryciaaimODOnGEIuGdSMgbm-eXmqOTBRt1JJAeaITIzSrGmVaBvF4ZiMOJO6Eoark3-MiVMyYrtNBVqbczLZbD4AgIFoWq1HpJuuKab3XOLQraKl2Pclo-3okKkvJReassUUf3CIeVddO9pjGSImWrzNq3477En4q35V-dOXzqOjAbdp2GmSL7i2_oKcBUwbPznkmLzd3b7OHqr58_3jbDqvIlNyqEQwHr3lpuGtUAwk07AMiF6EltugJAdwzKGxTlkvDQTXwtKha4S2gjViTK723ui9X_QlrrB8Lw6_iV_QJV7Y</recordid><startdate>201306</startdate><enddate>201306</enddate><creator>Sloan, Joseph</creator><creator>Kumar, Rakesh</creator><creator>Bronevetsky, Greg</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>201306</creationdate><title>An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance</title><author>Sloan, Joseph ; Kumar, Rakesh ; Bronevetsky, Greg</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-3f9eaec2942637105180bfaae3f62cf75200d1da9cd7ce590fd60bdad438c3143</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2013</creationdate><topic>algorithmic error correction</topic><topic>Circuit faults</topic><topic>Context</topic><topic>Error analysis</topic><topic>error localization</topic><topic>Fault tolerance</topic><topic>Fault tolerant systems</topic><topic>numerical methods</topic><topic>partial recomputation</topic><topic>sparse linear algebra</topic><topic>Sparse matrices</topic><topic>Vectors</topic><toplevel>online_resources</toplevel><creatorcontrib>Sloan, Joseph</creatorcontrib><creatorcontrib>Kumar, Rakesh</creatorcontrib><creatorcontrib>Bronevetsky, Greg</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sloan, Joseph</au><au>Kumar, Rakesh</au><au>Bronevetsky, Greg</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance</atitle><btitle>2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)</btitle><stitle>DSN</stitle><date>2013-06</date><risdate>2013</risdate><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>1530-0889</issn><eissn>2158-3927</eissn><isbn>9781467364713</isbn><isbn>1467364711</isbn><eisbn>9781467364720</eisbn><eisbn>146736472X</eisbn><abstract>The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.</abstract><pub>IEEE</pub><doi>10.1109/DSN.2013.6575309</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1530-0889
ispartof	2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2013, p.1-12
issn	1530-0889 2158-3927
language	eng
recordid	cdi_ieee_primary_6575309
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	algorithmic error correction Circuit faults Context Error analysis error localization Fault tolerance Fault tolerant systems numerical methods partial recomputation sparse linear algebra Sparse matrices Vectors
title	An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T06%3A06%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=An%20algorithmic%20approach%20to%20error%20localization%20and%20partial%20recomputation%20for%20low-overhead%20fault%20tolerance&rft.btitle=2013%2043rd%20Annual%20IEEE/IFIP%20International%20Conference%20on%20Dependable%20Systems%20and%20Networks%20(DSN)&rft.au=Sloan,%20Joseph&rft.date=2013-06&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=1530-0889&rft.eissn=2158-3927&rft.isbn=9781467364713&rft.isbn_list=1467364711&rft_id=info:doi/10.1109/DSN.2013.6575309&rft.eisbn=9781467364720&rft.eisbn_list=146736472X&rft_dat=%3Cieee_6IE%3E6575309%3C/ieee_6IE%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i175t-3f9eaec2942637105180bfaae3f62cf75200d1da9cd7ce590fd60bdad438c3143%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6575309&rfr_iscdi=true