Loading…

How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle

In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhao, Yujin, Jiang, Ling, Tao, Ye, Zhang, Songlin, Wu, Changlong, Wu, Yifan, Jia, Tong, Li, Ying, Wu, Zhonghai
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 274
container_issue
container_start_page 264
container_title
container_volume
creator Zhao, Yujin
Jiang, Ling
Tao, Ye
Zhang, Songlin
Wu, Changlong
Wu, Yifan
Jia, Tong
Li, Ying
Wu, Zhonghai
description In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.
doi_str_mv 10.1109/ISSRE59848.2023.00027
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10301272</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10301272</ieee_id><sourcerecordid>10301272</sourcerecordid><originalsourceid>FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53</originalsourceid><addsrcrecordid>eNo9jdFKwzAUQKMgOOf-QCE_0Jl7kzTNk8iYrlARrHseaXKzVbZWmg7Z3ztQfDovh3MYuwcxBxD2oazr96W2hSrmKFDOhRBoLtjMGltILSRoq-Qlm6CUmOVa2Wt2k9Ln2RIKcMLWq_6bjz1_dZ3bEl_sXLelrOzC0VPgZefbQN2YHnlFKfVd4nHoD3zcEa_HYzjxPv5LvGrjuXDye7plV9HtE83-OGXr5-XHYpVVby_l4qnK2vN-zEyhnXd5Y5RqsLEWfDSggtcemhg0IhVWA1gMoIyiAk3jrYq5RdfEPGo5ZXe_3ZaINl9De3DDaQNCCkCD8gfAPVD9</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><source>IEEE Xplore All Conference Series</source><creator>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</creator><creatorcontrib>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</creatorcontrib><description>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</description><identifier>EISSN: 2332-6549</identifier><identifier>EISBN: 9798350315943</identifier><identifier>DOI: 10.1109/ISSRE59848.2023.00027</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Data models ; empirical study ; incident management ; life cycle ; online service system ; software change ; Software reliability ; System software ; Task analysis</subject><ispartof>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, p.264-274</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10301272$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10301272$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhao, Yujin</creatorcontrib><creatorcontrib>Jiang, Ling</creatorcontrib><creatorcontrib>Tao, Ye</creatorcontrib><creatorcontrib>Zhang, Songlin</creatorcontrib><creatorcontrib>Wu, Changlong</creatorcontrib><creatorcontrib>Wu, Yifan</creatorcontrib><creatorcontrib>Jia, Tong</creatorcontrib><creatorcontrib>Li, Ying</creatorcontrib><creatorcontrib>Wu, Zhonghai</creatorcontrib><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><title>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)</title><addtitle>ISSRE</addtitle><description>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</description><subject>Data models</subject><subject>empirical study</subject><subject>incident management</subject><subject>life cycle</subject><subject>online service system</subject><subject>software change</subject><subject>Software reliability</subject><subject>System software</subject><subject>Task analysis</subject><issn>2332-6549</issn><isbn>9798350315943</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo9jdFKwzAUQKMgOOf-QCE_0Jl7kzTNk8iYrlARrHseaXKzVbZWmg7Z3ztQfDovh3MYuwcxBxD2oazr96W2hSrmKFDOhRBoLtjMGltILSRoq-Qlm6CUmOVa2Wt2k9Ln2RIKcMLWq_6bjz1_dZ3bEl_sXLelrOzC0VPgZefbQN2YHnlFKfVd4nHoD3zcEa_HYzjxPv5LvGrjuXDye7plV9HtE83-OGXr5-XHYpVVby_l4qnK2vN-zEyhnXd5Y5RqsLEWfDSggtcemhg0IhVWA1gMoIyiAk3jrYq5RdfEPGo5ZXe_3ZaINl9De3DDaQNCCkCD8gfAPVD9</recordid><startdate>20231009</startdate><enddate>20231009</enddate><creator>Zhao, Yujin</creator><creator>Jiang, Ling</creator><creator>Tao, Ye</creator><creator>Zhang, Songlin</creator><creator>Wu, Changlong</creator><creator>Wu, Yifan</creator><creator>Jia, Tong</creator><creator>Li, Ying</creator><creator>Wu, Zhonghai</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20231009</creationdate><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><author>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Data models</topic><topic>empirical study</topic><topic>incident management</topic><topic>life cycle</topic><topic>online service system</topic><topic>software change</topic><topic>Software reliability</topic><topic>System software</topic><topic>Task analysis</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Yujin</creatorcontrib><creatorcontrib>Jiang, Ling</creatorcontrib><creatorcontrib>Tao, Ye</creatorcontrib><creatorcontrib>Zhang, Songlin</creatorcontrib><creatorcontrib>Wu, Changlong</creatorcontrib><creatorcontrib>Wu, Yifan</creatorcontrib><creatorcontrib>Jia, Tong</creatorcontrib><creatorcontrib>Li, Ying</creatorcontrib><creatorcontrib>Wu, Zhonghai</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Yujin</au><au>Jiang, Ling</au><au>Tao, Ye</au><au>Zhang, Songlin</au><au>Wu, Changlong</au><au>Wu, Yifan</au><au>Jia, Tong</au><au>Li, Ying</au><au>Wu, Zhonghai</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</atitle><btitle>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)</btitle><stitle>ISSRE</stitle><date>2023-10-09</date><risdate>2023</risdate><spage>264</spage><epage>274</epage><pages>264-274</pages><eissn>2332-6549</eissn><eisbn>9798350315943</eisbn><coden>IEEPAD</coden><abstract>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</abstract><pub>IEEE</pub><doi>10.1109/ISSRE59848.2023.00027</doi><tpages>11</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2332-6549
ispartof 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, p.264-274
issn 2332-6549
language eng
recordid cdi_ieee_primary_10301272
source IEEE Xplore All Conference Series
subjects Data models
empirical study
incident management
life cycle
online service system
software change
Software reliability
System software
Task analysis
title How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T03%3A01%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=How%20to%20Manage%20Change-Induced%20Incidents?%20Lessons%20from%20the%20Study%20of%20Incident%20Life%20Cycle&rft.btitle=2023%20IEEE%2034th%20International%20Symposium%20on%20Software%20Reliability%20Engineering%20(ISSRE)&rft.au=Zhao,%20Yujin&rft.date=2023-10-09&rft.spage=264&rft.epage=274&rft.pages=264-274&rft.eissn=2332-6549&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ISSRE59848.2023.00027&rft.eisbn=9798350315943&rft_dat=%3Cieee_CHZPO%3E10301272%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10301272&rfr_iscdi=true