Loading…

Dependability analysis for characterizing Google cluster reliability

Summary Cloud solutions are emerging as a new suitable way of transforming traditional IT data centers to highly available and reliable computing resources for hosting critical applications and data. However, software and hardware failures are a common problem in cloud datacenters that can lead to h...

Full description

Saved in:
Bibliographic Details
Published in:International journal of communication systems 2019-11, Vol.32 (16), p.n/a
Main Authors: Mesbahi, Mohammad Reza, Rahmani, Amir Masoud, Hosseinzadeh, Mehdi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Summary Cloud solutions are emerging as a new suitable way of transforming traditional IT data centers to highly available and reliable computing resources for hosting critical applications and data. However, software and hardware failures are a common problem in cloud datacenters that can lead to harmful damages. In this paper, we analyze the physical server failures in the Google cloud datacenter. We study the Google cluster properties to investigate the relationship among physical servers' failure rate and jobs failure events. The failure rate of Google cluster executed jobs and servers is taken into consideration during a 29‐day period. We present a reliability model for Google cluster physical machines using the continuous time Markov chains according to this observation. We attempt to analyze the obtained model through SHARPE software packages to improve the understanding of failure events in the Google cloud cluster. We also explore the cluster availability based on parameters like steady‐state availability, steady‐state unavailability, mean time to failure, and mean time to repair in the Google cluster. The objective of this paper is to study the Google cluster properties to investigate the relationship among physical servers' failure and jobs failure. A reliability model for Google cluster physical machines is represented. We attempt to analyze the obtained model through SHARPE software packages to improve the understanding of failure events in the Google cluster. The results show that there is a strong correlation between the machine and job failures rate.
ISSN:1074-5351
1099-1131
DOI:10.1002/dac.4127