Loading…

How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle

In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhao, Yujin, Jiang, Ling, Tao, Ye, Zhang, Songlin, Wu, Changlong, Wu, Yifan, Jia, Tong, Li, Ying, Wu, Zhonghai
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.
ISSN:2332-6549
DOI:10.1109/ISSRE59848.2023.00027