Loading…

Generic and robust root cause localization for multi-dimensional data in online service systems

Localizing root causes for multi-dimensional data is critical to ensure online service systems’ reliability. When a fault occurs, only the measure values within specific attribute combinations (e.g., Province = Beijing) are abnormal. Such attribute combinations are substantial clues to the underlyin...

Full description

Saved in:
Bibliographic Details
Published in:The Journal of systems and software 2023-09, Vol.203, p.111748, Article 111748
Main Authors: Li, Zeyan, Chen, Junjie, Chen, Yihao, Luo, Chengyang, Zhao, Yiwei, Sun, Yongqian, Sui, Kaixin, Wang, Xiping, Liu, Dapeng, Jin, Xing, Wang, Qi, Pei, Dan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Localizing root causes for multi-dimensional data is critical to ensure online service systems’ reliability. When a fault occurs, only the measure values within specific attribute combinations (e.g., Province = Beijing) are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multi-dimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 s across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world. •Finding root cause clues from multi-dimensional data efficiently.•A general property of root-cause attribute combinations.•Localizing root causes by combining bottom-up and top-down.•Extensive studies based on both simulated and injected faults.•Success stories on real-world systems and real-world faults.
ISSN:0164-1212
1873-1228
DOI:10.1016/j.jss.2023.111748