Loading…

Fast Recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications

Existing Hadoop MapReduce fault tolerance strategy causes the computing jobs suffering from high performance penalty during failure recovery. In this paper, we propose Fast Recovery MapReduce (FAR-MR) to improve MapReduce performance in failure recovery. FAR-MR includes a novel fault tolerance strat...

Full description

Saved in:
Bibliographic Details
Published in:The Journal of supercomputing 2020-05, Vol.76 (5), p.3572-3588
Main Authors: Zhu, Yongqing, Samsudin, Juniarto, Kanagavelu, Renuga, Zhang, Weiwen, Wang, Long, Aye, Theint Theint, Goh, Rick Siow Mong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Existing Hadoop MapReduce fault tolerance strategy causes the computing jobs suffering from high performance penalty during failure recovery. In this paper, we propose Fast Recovery MapReduce (FAR-MR) to improve MapReduce performance in failure recovery. FAR-MR includes a novel fault tolerance strategy that combines distributed checkpointing and proactive push mechanism to support fast recovery from task failure and node failure. With distributed checkpointing, computing progress of each task is recorded as checkpoints periodically and kept in distributed data storage. The recovered task can obtain the last progress of the failed task from the distributed storage during failure recovery. In addition, the proactive push mechanism enables the computing results of map tasks to be proactively transmitted to the nodes hosting reduce tasks of the same computing job. When a failure happens, the partial output results being pushed to the reducer nodes can be used by the reduce tasks without the necessity of re-compute. FAR-MR allows a failed task to be recovered efficiently at any node in the cluster. The performance evaluation has shown that the proposed FAR-MR can improve computing job performance by up to 62% and 45% compared to Hadoop MapReduce in the case of task failure recovery and node failure recovery, respectively.
ISSN:0920-8542
1573-0484
DOI:10.1007/s11227-018-2716-8