Loading…

Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters

To ensure the performance of large-scale datacenters, operators need to monitor up to tens of millions of various-type KPIs, e.g., CPU utilization, memory utilization. For each KPI, it is crucial but challenging to detect outliers that deviate from its historical patterns or the patterns of other KP...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on computers 2023-10, Vol.72 (10), p.1-13
Main Authors: Sun, Yongqian, Cheng, Daguo, Yang, Tiankai, Ji, Yuhe, Zhang, Shenglin, Zhu, Man, Xiong, Xiao, Fan, Qiliang, Liang, Minghan, Pei, Dan, Ma, Tianchi, Chen, Yu
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To ensure the performance of large-scale datacenters, operators need to monitor up to tens of millions of various-type KPIs, e.g., CPU utilization, memory utilization. For each KPI, it is crucial but challenging to detect outliers that deviate from its historical patterns or the patterns of other KPIs in the same period. In this work, we propose OutSpot , an unsupervised outlier detection framework that integrates hierarchical agglomerative clustering (HAC) with conditional variational autoencoder (CVAE), which significantly improves computational efficiency and comprehensively learns the above two patterns. Additionally, two simple yet effective techniques, soft threshold and median filter, are applied to precisely determine outlier KPIs. Using two real-world datasets collected from the datacenters owned by a top-tier global short video service provider and a top-tier domestic operator,respectively. It demonstrates that OutSpot achieves the best F1 score of 0.95 and 0.91, AUC of 0.99 and 0.99 on the two datasets, significantly outperforming seven baseline outlier detection methods.
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2023.3272288