Loading…
How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle
In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 274 |
container_issue | |
container_start_page | 264 |
container_title | |
container_volume | |
creator | Zhao, Yujin Jiang, Ling Tao, Ye Zhang, Songlin Wu, Changlong Wu, Yifan Jia, Tong Li, Ying Wu, Zhonghai |
description | In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management. |
doi_str_mv | 10.1109/ISSRE59848.2023.00027 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_10301272</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10301272</ieee_id><sourcerecordid>10301272</sourcerecordid><originalsourceid>FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53</originalsourceid><addsrcrecordid>eNo9jdFKwzAUQKMgOOf-QCE_0Jl7kzTNk8iYrlARrHseaXKzVbZWmg7Z3ztQfDovh3MYuwcxBxD2oazr96W2hSrmKFDOhRBoLtjMGltILSRoq-Qlm6CUmOVa2Wt2k9Ln2RIKcMLWq_6bjz1_dZ3bEl_sXLelrOzC0VPgZefbQN2YHnlFKfVd4nHoD3zcEa_HYzjxPv5LvGrjuXDye7plV9HtE83-OGXr5-XHYpVVby_l4qnK2vN-zEyhnXd5Y5RqsLEWfDSggtcemhg0IhVWA1gMoIyiAk3jrYq5RdfEPGo5ZXe_3ZaINl9De3DDaQNCCkCD8gfAPVD9</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><source>IEEE Xplore All Conference Series</source><creator>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</creator><creatorcontrib>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</creatorcontrib><description>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</description><identifier>EISSN: 2332-6549</identifier><identifier>EISBN: 9798350315943</identifier><identifier>DOI: 10.1109/ISSRE59848.2023.00027</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Data models ; empirical study ; incident management ; life cycle ; online service system ; software change ; Software reliability ; System software ; Task analysis</subject><ispartof>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, p.264-274</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10301272$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10301272$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhao, Yujin</creatorcontrib><creatorcontrib>Jiang, Ling</creatorcontrib><creatorcontrib>Tao, Ye</creatorcontrib><creatorcontrib>Zhang, Songlin</creatorcontrib><creatorcontrib>Wu, Changlong</creatorcontrib><creatorcontrib>Wu, Yifan</creatorcontrib><creatorcontrib>Jia, Tong</creatorcontrib><creatorcontrib>Li, Ying</creatorcontrib><creatorcontrib>Wu, Zhonghai</creatorcontrib><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><title>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)</title><addtitle>ISSRE</addtitle><description>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</description><subject>Data models</subject><subject>empirical study</subject><subject>incident management</subject><subject>life cycle</subject><subject>online service system</subject><subject>software change</subject><subject>Software reliability</subject><subject>System software</subject><subject>Task analysis</subject><issn>2332-6549</issn><isbn>9798350315943</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2023</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNo9jdFKwzAUQKMgOOf-QCE_0Jl7kzTNk8iYrlARrHseaXKzVbZWmg7Z3ztQfDovh3MYuwcxBxD2oazr96W2hSrmKFDOhRBoLtjMGltILSRoq-Qlm6CUmOVa2Wt2k9Ln2RIKcMLWq_6bjz1_dZ3bEl_sXLelrOzC0VPgZefbQN2YHnlFKfVd4nHoD3zcEa_HYzjxPv5LvGrjuXDye7plV9HtE83-OGXr5-XHYpVVby_l4qnK2vN-zEyhnXd5Y5RqsLEWfDSggtcemhg0IhVWA1gMoIyiAk3jrYq5RdfEPGo5ZXe_3ZaINl9De3DDaQNCCkCD8gfAPVD9</recordid><startdate>20231009</startdate><enddate>20231009</enddate><creator>Zhao, Yujin</creator><creator>Jiang, Ling</creator><creator>Tao, Ye</creator><creator>Zhang, Songlin</creator><creator>Wu, Changlong</creator><creator>Wu, Yifan</creator><creator>Jia, Tong</creator><creator>Li, Ying</creator><creator>Wu, Zhonghai</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>20231009</creationdate><title>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</title><author>Zhao, Yujin ; Jiang, Ling ; Tao, Ye ; Zhang, Songlin ; Wu, Changlong ; Wu, Yifan ; Jia, Tong ; Li, Ying ; Wu, Zhonghai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Data models</topic><topic>empirical study</topic><topic>incident management</topic><topic>life cycle</topic><topic>online service system</topic><topic>software change</topic><topic>Software reliability</topic><topic>System software</topic><topic>Task analysis</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Yujin</creatorcontrib><creatorcontrib>Jiang, Ling</creatorcontrib><creatorcontrib>Tao, Ye</creatorcontrib><creatorcontrib>Zhang, Songlin</creatorcontrib><creatorcontrib>Wu, Changlong</creatorcontrib><creatorcontrib>Wu, Yifan</creatorcontrib><creatorcontrib>Jia, Tong</creatorcontrib><creatorcontrib>Li, Ying</creatorcontrib><creatorcontrib>Wu, Zhonghai</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Yujin</au><au>Jiang, Ling</au><au>Tao, Ye</au><au>Zhang, Songlin</au><au>Wu, Changlong</au><au>Wu, Yifan</au><au>Jia, Tong</au><au>Li, Ying</au><au>Wu, Zhonghai</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle</atitle><btitle>2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)</btitle><stitle>ISSRE</stitle><date>2023-10-09</date><risdate>2023</risdate><spage>264</spage><epage>274</epage><pages>264-274</pages><eissn>2332-6549</eissn><eisbn>9798350315943</eisbn><coden>IEEPAD</coden><abstract>In online service systems, software changes cause a majority of incidents (i.e., unplanned interruptions and outages). Managing change-induced incidents efficiently is crucial for ensuring the reliability and availability of online service systems. Understanding the incidents can help improve change-induced incident management. The task is challenging because the life cycle of change-induced incidents is complicated due to diverse change deployment and incident resolution procedures. Detailed records of the incidents and changes, together with a comprehensive analysis, are needed to gain an in-depth understanding. In this paper, we conduct a qualitative and quantitative study on 231 change-induced incidents in a real-world, large-scale online service system. Detailed change tickets and incident timeline in the post-mortems provides extensive information about the incident life cycle, enabling us to understand each incident in depth. Based on the data, we give a generic model of the complicated life cycle of change-induced incidents. Following the model, we systematically study the whole life cycle of the incident, including the introduction and resolution stages, and answer what affects the efficiency of resolution. We obtain 9 major findings from our study. Based on the findings, we discuss existing techniques and promising future directions for improving change-induced incident management.</abstract><pub>IEEE</pub><doi>10.1109/ISSRE59848.2023.00027</doi><tpages>11</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2332-6549 |
ispartof | 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, p.264-274 |
issn | 2332-6549 |
language | eng |
recordid | cdi_ieee_primary_10301272 |
source | IEEE Xplore All Conference Series |
subjects | Data models empirical study incident management life cycle online service system software change Software reliability System software Task analysis |
title | How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T03%3A01%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=How%20to%20Manage%20Change-Induced%20Incidents?%20Lessons%20from%20the%20Study%20of%20Incident%20Life%20Cycle&rft.btitle=2023%20IEEE%2034th%20International%20Symposium%20on%20Software%20Reliability%20Engineering%20(ISSRE)&rft.au=Zhao,%20Yujin&rft.date=2023-10-09&rft.spage=264&rft.epage=274&rft.pages=264-274&rft.eissn=2332-6549&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ISSRE59848.2023.00027&rft.eisbn=9798350315943&rft_dat=%3Cieee_CHZPO%3E10301272%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i204t-785aca6b744b2b991cf714dc5c1bfd522e8951192d1474e827bc94f692abf6f53%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10301272&rfr_iscdi=true |