Loading…

Online Model-Based Clustering for Crisis Identification in Distributed Computing

Large-scale distributed computing systems can suffer from occasional severe violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is too slow to inform interventions taken during the crisis. Rapid automatic recognition of the recurrence o...

Full description

Saved in:
Bibliographic Details
Published in:Journal of the American Statistical Association 2011-03, Vol.106 (493), p.49-60
Main Authors: Woodard, Dawn B., Goldszmidt, Moises
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Large-scale distributed computing systems can suffer from occasional severe violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is too slow to inform interventions taken during the crisis. Rapid automatic recognition of the recurrence of a problem can lead to cause diagnosis and informed intervention. We frame this as an online clustering problem, where the labels (causes) of some of the previous crises may be known. We give a fast and accurate solution using model-based clustering based on a Dirichlet process mixture; the evolution of each crisis is modeled as a multivariate time series. In the periods between crises we perform full Bayesian inference for the past crises, and as a new crisis occurs we apply fast approximate Bayesian updating. These inferences allow real-time expected-cost-minimizing decision making that fully accounts for uncertainty in the crisis labels and other parameters. We apply and validate our methods using simulated data and data from a production computing center with hundreds of servers running a 24/7 email-related application.
ISSN:0162-1459
1537-274X
DOI:10.1198/jasa.2010.ap09545