Loading…

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic...

Full description

Saved in:

Bibliographic Details
Published in:	Empirical software engineering : an international journal 2022-03, Vol.27 (2), Article 53
Main Authors:	Rodrigues, Irving Muller, Khvorov, Aleksandr, Aloise, Daniel, Vasiliev, Roman, Koznov, Dmitrij, Fernandes, Eraldo Rezende, Chernishev, George, Luciv, Dmitry, Povarov, Nikita
Format:	Article
Language:	English
Subjects:	Ablation Algorithms Alignment Compilers Computer Science Datasets Failure analysis Information retrieval Interpreters Machine learning Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE) Matching Programming Languages Reproduction (copying) Similarity Software Software Engineering/Programming and Operating Systems Source code
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3
cites	cdi_FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3
container_end_page
container_issue	2
container_start_page
container_title	Empirical software engineering : an international journal
container_volume	27
creator	Rodrigues, Irving Muller Khvorov, Aleksandr Aloise, Daniel Vasiliev, Roman Koznov, Dmitrij Fernandes, Eraldo Rezende Chernishev, George Luciv, Dmitry Povarov, Nikita
description	Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.
doi_str_mv	10.1007/s10664-021-10070-w
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2634670626</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2634670626</sourcerecordid><originalsourceid>FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3</originalsourceid><addsrcrecordid>eNp9kMtOwzAURC0EEqXwA6wssTb4FTtmVypeUhGLlrXlxDclpUmKnarq3-M2SOxYzR1pzlxpELpm9JZRqu8io0pJQjkjB0_J7gSNWKYF0Yqp03SLnBPBM3WOLmJcUUqNltkIPSyCK2FeN_d40uLJul62DbQ9foP-s_O46gKeds1m29ftEs97V37hI4ETUq9dqPv9JTqr3DrC1a-O0cfT42L6Qmbvz6_TyYyUQome-CLTIExmDBTAwRuVS-eMqSQwDtIYZzTlwAtf5rnyNPe5kKVnolBKcwZijG6G3k3ovrcQe7vqtqFNLy1XQipNVdIx4kOqDF2MASq7CXXjwt4yag_b2GErm7Y6emp3CRIDFFO4XUL4q_6H-gFqVWu9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2634670626</pqid></control><display><type>article</type><title>TraceSim: An Alignment Method for Computing Stack Trace Similarity</title><source>Springer Link</source><creator>Rodrigues, Irving Muller ; Khvorov, Aleksandr ; Aloise, Daniel ; Vasiliev, Roman ; Koznov, Dmitrij ; Fernandes, Eraldo Rezende ; Chernishev, George ; Luciv, Dmitry ; Povarov, Nikita</creator><creatorcontrib>Rodrigues, Irving Muller ; Khvorov, Aleksandr ; Aloise, Daniel ; Vasiliev, Roman ; Koznov, Dmitrij ; Fernandes, Eraldo Rezende ; Chernishev, George ; Luciv, Dmitry ; Povarov, Nikita</creatorcontrib><description>Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.</description><identifier>ISSN: 1382-3256</identifier><identifier>EISSN: 1573-7616</identifier><identifier>DOI: 10.1007/s10664-021-10070-w</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Ablation ; Algorithms ; Alignment ; Compilers ; Computer Science ; Datasets ; Failure analysis ; Information retrieval ; Interpreters ; Machine learning ; Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE) ; Matching ; Programming Languages ; Reproduction (copying) ; Similarity ; Software ; Software Engineering/Programming and Operating Systems ; Source code</subject><ispartof>Empirical software engineering : an international journal, 2022-03, Vol.27 (2), Article 53</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3</citedby><cites>FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3</cites><orcidid>0000-0001-5478-4099</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Rodrigues, Irving Muller</creatorcontrib><creatorcontrib>Khvorov, Aleksandr</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><creatorcontrib>Vasiliev, Roman</creatorcontrib><creatorcontrib>Koznov, Dmitrij</creatorcontrib><creatorcontrib>Fernandes, Eraldo Rezende</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><creatorcontrib>Luciv, Dmitry</creatorcontrib><creatorcontrib>Povarov, Nikita</creatorcontrib><title>TraceSim: An Alignment Method for Computing Stack Trace Similarity</title><title>Empirical software engineering : an international journal</title><addtitle>Empir Software Eng</addtitle><description>Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.</description><subject>Ablation</subject><subject>Algorithms</subject><subject>Alignment</subject><subject>Compilers</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Failure analysis</subject><subject>Information retrieval</subject><subject>Interpreters</subject><subject>Machine learning</subject><subject>Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)</subject><subject>Matching</subject><subject>Programming Languages</subject><subject>Reproduction (copying)</subject><subject>Similarity</subject><subject>Software</subject><subject>Software Engineering/Programming and Operating Systems</subject><subject>Source code</subject><issn>1382-3256</issn><issn>1573-7616</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kMtOwzAURC0EEqXwA6wssTb4FTtmVypeUhGLlrXlxDclpUmKnarq3-M2SOxYzR1pzlxpELpm9JZRqu8io0pJQjkjB0_J7gSNWKYF0Yqp03SLnBPBM3WOLmJcUUqNltkIPSyCK2FeN_d40uLJul62DbQ9foP-s_O46gKeds1m29ftEs97V37hI4ETUq9dqPv9JTqr3DrC1a-O0cfT42L6Qmbvz6_TyYyUQome-CLTIExmDBTAwRuVS-eMqSQwDtIYZzTlwAtf5rnyNPe5kKVnolBKcwZijG6G3k3ovrcQe7vqtqFNLy1XQipNVdIx4kOqDF2MASq7CXXjwt4yag_b2GErm7Y6emp3CRIDFFO4XUL4q_6H-gFqVWu9</recordid><startdate>20220301</startdate><enddate>20220301</enddate><creator>Rodrigues, Irving Muller</creator><creator>Khvorov, Aleksandr</creator><creator>Aloise, Daniel</creator><creator>Vasiliev, Roman</creator><creator>Koznov, Dmitrij</creator><creator>Fernandes, Eraldo Rezende</creator><creator>Chernishev, George</creator><creator>Luciv, Dmitry</creator><creator>Povarov, Nikita</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>S0W</scope><orcidid>https://orcid.org/0000-0001-5478-4099</orcidid></search><sort><creationdate>20220301</creationdate><title>TraceSim: An Alignment Method for Computing Stack Trace Similarity</title><author>Rodrigues, Irving Muller ; Khvorov, Aleksandr ; Aloise, Daniel ; Vasiliev, Roman ; Koznov, Dmitrij ; Fernandes, Eraldo Rezende ; Chernishev, George ; Luciv, Dmitry ; Povarov, Nikita</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Ablation</topic><topic>Algorithms</topic><topic>Alignment</topic><topic>Compilers</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Failure analysis</topic><topic>Information retrieval</topic><topic>Interpreters</topic><topic>Machine learning</topic><topic>Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)</topic><topic>Matching</topic><topic>Programming Languages</topic><topic>Reproduction (copying)</topic><topic>Similarity</topic><topic>Software</topic><topic>Software Engineering/Programming and Operating Systems</topic><topic>Source code</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Rodrigues, Irving Muller</creatorcontrib><creatorcontrib>Khvorov, Aleksandr</creatorcontrib><creatorcontrib>Aloise, Daniel</creatorcontrib><creatorcontrib>Vasiliev, Roman</creatorcontrib><creatorcontrib>Koznov, Dmitrij</creatorcontrib><creatorcontrib>Fernandes, Eraldo Rezende</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><creatorcontrib>Luciv, Dmitry</creatorcontrib><creatorcontrib>Povarov, Nikita</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Database‎ (1962 - current)</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>DELNET Engineering & Technology Collection</collection><jtitle>Empirical software engineering : an international journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Rodrigues, Irving Muller</au><au>Khvorov, Aleksandr</au><au>Aloise, Daniel</au><au>Vasiliev, Roman</au><au>Koznov, Dmitrij</au><au>Fernandes, Eraldo Rezende</au><au>Chernishev, George</au><au>Luciv, Dmitry</au><au>Povarov, Nikita</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TraceSim: An Alignment Method for Computing Stack Trace Similarity</atitle><jtitle>Empirical software engineering : an international journal</jtitle><stitle>Empir Software Eng</stitle><date>2022-03-01</date><risdate>2022</risdate><volume>27</volume><issue>2</issue><artnum>53</artnum><issn>1382-3256</issn><eissn>1573-7616</eissn><abstract>Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10664-021-10070-w</doi><orcidid>https://orcid.org/0000-0001-5478-4099</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1382-3256
ispartof	Empirical software engineering : an international journal, 2022-03, Vol.27 (2), Article 53
issn	1382-3256 1573-7616
language	eng
recordid	cdi_proquest_journals_2634670626
source	Springer Link
subjects	Ablation Algorithms Alignment Compilers Computer Science Datasets Failure analysis Information retrieval Interpreters Machine learning Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE) Matching Programming Languages Reproduction (copying) Similarity Software Software Engineering/Programming and Operating Systems Source code
title	TraceSim: An Alignment Method for Computing Stack Trace Similarity
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A18%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TraceSim:%20An%20Alignment%20Method%20for%20Computing%20Stack%20Trace%20Similarity&rft.jtitle=Empirical%20software%20engineering%20:%20an%20international%20journal&rft.au=Rodrigues,%20Irving%20Muller&rft.date=2022-03-01&rft.volume=27&rft.issue=2&rft.artnum=53&rft.issn=1382-3256&rft.eissn=1573-7616&rft_id=info:doi/10.1007/s10664-021-10070-w&rft_dat=%3Cproquest_cross%3E2634670626%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c363t-db57e39599ebe2ed9684aa99f4e12e499a9702e2bdc886d08d834cd13b66721e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2634670626&rft_id=info:pmid/&rfr_iscdi=true