Loading…

Predicting DRAM-Caused Node Unavailability in Hyper-Scale Clouds

DRAM faults are major hardware sources of cloud node unavailability. To enable early preventive actions and mitigate DRAM fault impacts, prior studies focus on predicting DRAM uncorrectable errors (UEs) that typically cause immediate node unavailability. In our cloud with over half a million nodes,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang, Pengcheng, Wang, Yunong, Ma, Xuhua, Xu, Yaoheng, Yao, Bin, Zheng, Xudong, Jiang, Linquan
Format:	Conference Proceeding
Language:	English
Subjects:	Clouds Hardware Measurement Predictive models Production Random access memory Storms
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	DRAM faults are major hardware sources of cloud node unavailability. To enable early preventive actions and mitigate DRAM fault impacts, prior studies focus on predicting DRAM uncorrectable errors (UEs) that typically cause immediate node unavailability. In our cloud with over half a million nodes, we firstly observe that the correctable error storm (numerous CEs occur in a short period) dominates 56% DRAM-caused node unavailability (DCNU). Therefore, we propose to predict DCNU that takes account into both UEs and CE storms. Observing that DCNUs have strong relevance to temporal statistics and spatial patterns of CEs, we design novel spatio-temporal features to train the prediction model. Considering the model's real effects cannot be evaluated by traditional metrics like F1-score, we propose a new metric NURR to quantify the node unavailability reduction and tune model hyperparameters with NURR. Our approach achieves over 40% better NURR than existing methods on historical data and runs stably in the production environment.
ISSN:	2158-3927
DOI:	10.1109/DSN53405.2022.00037