Loading…

A text classification approach to API type resolution for incomplete code snippets

The Stack Overflow Q&A platform boasts an active community of users who often include code snippets in their questions and answers. Several development tools rely on these code snippets as a source of information. Although code snippets are intended as examples for humans, they may not form comp...

Full description

Saved in:

Bibliographic Details
Published in:	Science of computer programming 2023-04, Vol.227, p.102941, Article 102941
Main Authors:	Velázquez-Rodríguez, Camilo, Di Nucci, Dario, De Roover, Coen
Format:	Article
Language:	English
Subjects:	Fully qualified name resolution Machine learning Stack overflow Text classification
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173
cites	cdi_FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173
container_end_page
container_issue
container_start_page	102941
container_title	Science of computer programming
container_volume	227
creator	Velázquez-Rodríguez, Camilo Di Nucci, Dario De Roover, Coen
description	The Stack Overflow Q&A platform boasts an active community of users who often include code snippets in their questions and answers. Several development tools rely on these code snippets as a source of information. Although code snippets are intended as examples for humans, they may not form compilation units. For instance, snippets illustrating how to use an API might lack the import statements for the corresponding API types. Thus, it becomes essential to determine the fully-qualified name of API types in incomplete snippets. We present RESICO, a machine learning-based text classification approach to resolving the simple name of API types to their fully-qualified names. RESICO is trained on a corpus of Java programs for which a compiler can determine the fully-qualified names. For four machine learning classifiers, we evaluate the type resolution accuracy of the resulting models on the original and an extended version of datasets of snippets previously used to evaluate the current state-of-the-art approach based on information retrieval. Results show that our approach outperforms the state-of-the-art one, although the training phase is slightly slower. We observe that most of the incorrect type resolutions are not due to ambiguities among the simple names for API types but due to similarities among the contexts in which these types are used, representing a future research challenge. •Stack Overflow code snippets might lack information about referenced API types.•RESICO is an approach to resolve simple API names to their fully qualified versions.•RESICO encodes API references and their contexts to train a classification algorithm.•Our approach outperforms the state-of-the-art COSTER in an in-depth evaluation.•Mispredictions by the models are mainly due to similar contexts around API usages.
doi_str_mv	10.1016/j.scico.2023.102941
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_scico_2023_102941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0167642323000230</els_id><sourcerecordid>S0167642323000230</sourcerecordid><originalsourceid>FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKu_wEv-wNZ8dbN78FCKH4WCInoO6ewsZtluQhLF_nu3Xc-eBmbeZ3h5CLnlbMEZL--6RQIHfiGYkONG1IqfkRmvtCh0XapzMhtTuiiVkJfkKqWOMVYqzWfkbUUz_mQKvU3JtQ5sdn6gNoToLXzS7OnqdUPzISCNmHz_dbq3PlI3gN-HHjNS8A3SNLgQMKdrctHaPuHN35yTj8eH9_VzsX152qxX2wIkk7moLNRjKQ5iJwElg9ZKbUUDoIRljAsLWO6kkCiaZQnIFCz10rZ1hRIU13JO5PQXok8pYmtCdHsbD4Yzc9RiOnPSYo5azKRlpO4nCsdq3w7jMYMDYOMiQjaNd__yv-uEbak</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A text classification approach to API type resolution for incomplete code snippets</title><source>Elsevier</source><creator>Velázquez-Rodríguez, Camilo ; Di Nucci, Dario ; De Roover, Coen</creator><creatorcontrib>Velázquez-Rodríguez, Camilo ; Di Nucci, Dario ; De Roover, Coen</creatorcontrib><description>The Stack Overflow Q&A platform boasts an active community of users who often include code snippets in their questions and answers. Several development tools rely on these code snippets as a source of information. Although code snippets are intended as examples for humans, they may not form compilation units. For instance, snippets illustrating how to use an API might lack the import statements for the corresponding API types. Thus, it becomes essential to determine the fully-qualified name of API types in incomplete snippets. We present RESICO, a machine learning-based text classification approach to resolving the simple name of API types to their fully-qualified names. RESICO is trained on a corpus of Java programs for which a compiler can determine the fully-qualified names. For four machine learning classifiers, we evaluate the type resolution accuracy of the resulting models on the original and an extended version of datasets of snippets previously used to evaluate the current state-of-the-art approach based on information retrieval. Results show that our approach outperforms the state-of-the-art one, although the training phase is slightly slower. We observe that most of the incorrect type resolutions are not due to ambiguities among the simple names for API types but due to similarities among the contexts in which these types are used, representing a future research challenge. •Stack Overflow code snippets might lack information about referenced API types.•RESICO is an approach to resolve simple API names to their fully qualified versions.•RESICO encodes API references and their contexts to train a classification algorithm.•Our approach outperforms the state-of-the-art COSTER in an in-depth evaluation.•Mispredictions by the models are mainly due to similar contexts around API usages.</description><identifier>ISSN: 0167-6423</identifier><identifier>EISSN: 1872-7964</identifier><identifier>DOI: 10.1016/j.scico.2023.102941</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Fully qualified name resolution ; Machine learning ; Stack overflow ; Text classification</subject><ispartof>Science of computer programming, 2023-04, Vol.227, p.102941, Article 102941</ispartof><rights>2023 Elsevier B.V.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173</citedby><cites>FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173</cites><orcidid>0000-0002-8360-1519</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27923,27924</link.rule.ids></links><search><creatorcontrib>Velázquez-Rodríguez, Camilo</creatorcontrib><creatorcontrib>Di Nucci, Dario</creatorcontrib><creatorcontrib>De Roover, Coen</creatorcontrib><title>A text classification approach to API type resolution for incomplete code snippets</title><title>Science of computer programming</title><description>The Stack Overflow Q&A platform boasts an active community of users who often include code snippets in their questions and answers. Several development tools rely on these code snippets as a source of information. Although code snippets are intended as examples for humans, they may not form compilation units. For instance, snippets illustrating how to use an API might lack the import statements for the corresponding API types. Thus, it becomes essential to determine the fully-qualified name of API types in incomplete snippets. We present RESICO, a machine learning-based text classification approach to resolving the simple name of API types to their fully-qualified names. RESICO is trained on a corpus of Java programs for which a compiler can determine the fully-qualified names. For four machine learning classifiers, we evaluate the type resolution accuracy of the resulting models on the original and an extended version of datasets of snippets previously used to evaluate the current state-of-the-art approach based on information retrieval. Results show that our approach outperforms the state-of-the-art one, although the training phase is slightly slower. We observe that most of the incorrect type resolutions are not due to ambiguities among the simple names for API types but due to similarities among the contexts in which these types are used, representing a future research challenge. •Stack Overflow code snippets might lack information about referenced API types.•RESICO is an approach to resolve simple API names to their fully qualified versions.•RESICO encodes API references and their contexts to train a classification algorithm.•Our approach outperforms the state-of-the-art COSTER in an in-depth evaluation.•Mispredictions by the models are mainly due to similar contexts around API usages.</description><subject>Fully qualified name resolution</subject><subject>Machine learning</subject><subject>Stack overflow</subject><subject>Text classification</subject><issn>0167-6423</issn><issn>1872-7964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LAzEQhoMoWKu_wEv-wNZ8dbN78FCKH4WCInoO6ewsZtluQhLF_nu3Xc-eBmbeZ3h5CLnlbMEZL--6RQIHfiGYkONG1IqfkRmvtCh0XapzMhtTuiiVkJfkKqWOMVYqzWfkbUUz_mQKvU3JtQ5sdn6gNoToLXzS7OnqdUPzISCNmHz_dbq3PlI3gN-HHjNS8A3SNLgQMKdrctHaPuHN35yTj8eH9_VzsX152qxX2wIkk7moLNRjKQ5iJwElg9ZKbUUDoIRljAsLWO6kkCiaZQnIFCz10rZ1hRIU13JO5PQXok8pYmtCdHsbD4Yzc9RiOnPSYo5azKRlpO4nCsdq3w7jMYMDYOMiQjaNd__yv-uEbak</recordid><startdate>202304</startdate><enddate>202304</enddate><creator>Velázquez-Rodríguez, Camilo</creator><creator>Di Nucci, Dario</creator><creator>De Roover, Coen</creator><general>Elsevier B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8360-1519</orcidid></search><sort><creationdate>202304</creationdate><title>A text classification approach to API type resolution for incomplete code snippets</title><author>Velázquez-Rodríguez, Camilo ; Di Nucci, Dario ; De Roover, Coen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Fully qualified name resolution</topic><topic>Machine learning</topic><topic>Stack overflow</topic><topic>Text classification</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Velázquez-Rodríguez, Camilo</creatorcontrib><creatorcontrib>Di Nucci, Dario</creatorcontrib><creatorcontrib>De Roover, Coen</creatorcontrib><collection>CrossRef</collection><jtitle>Science of computer programming</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Velázquez-Rodríguez, Camilo</au><au>Di Nucci, Dario</au><au>De Roover, Coen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A text classification approach to API type resolution for incomplete code snippets</atitle><jtitle>Science of computer programming</jtitle><date>2023-04</date><risdate>2023</risdate><volume>227</volume><spage>102941</spage><pages>102941-</pages><artnum>102941</artnum><issn>0167-6423</issn><eissn>1872-7964</eissn><abstract>The Stack Overflow Q&A platform boasts an active community of users who often include code snippets in their questions and answers. Several development tools rely on these code snippets as a source of information. Although code snippets are intended as examples for humans, they may not form compilation units. For instance, snippets illustrating how to use an API might lack the import statements for the corresponding API types. Thus, it becomes essential to determine the fully-qualified name of API types in incomplete snippets. We present RESICO, a machine learning-based text classification approach to resolving the simple name of API types to their fully-qualified names. RESICO is trained on a corpus of Java programs for which a compiler can determine the fully-qualified names. For four machine learning classifiers, we evaluate the type resolution accuracy of the resulting models on the original and an extended version of datasets of snippets previously used to evaluate the current state-of-the-art approach based on information retrieval. Results show that our approach outperforms the state-of-the-art one, although the training phase is slightly slower. We observe that most of the incorrect type resolutions are not due to ambiguities among the simple names for API types but due to similarities among the contexts in which these types are used, representing a future research challenge. •Stack Overflow code snippets might lack information about referenced API types.•RESICO is an approach to resolve simple API names to their fully qualified versions.•RESICO encodes API references and their contexts to train a classification algorithm.•Our approach outperforms the state-of-the-art COSTER in an in-depth evaluation.•Mispredictions by the models are mainly due to similar contexts around API usages.</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.scico.2023.102941</doi><orcidid>https://orcid.org/0000-0002-8360-1519</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0167-6423
ispartof	Science of computer programming, 2023-04, Vol.227, p.102941, Article 102941
issn	0167-6423 1872-7964
language	eng
recordid	cdi_crossref_primary_10_1016_j_scico_2023_102941
source	Elsevier
subjects	Fully qualified name resolution Machine learning Stack overflow Text classification
title	A text classification approach to API type resolution for incomplete code snippets
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T17%3A04%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20text%20classification%20approach%20to%20API%20type%20resolution%20for%20incomplete%20code%20snippets&rft.jtitle=Science%20of%20computer%20programming&rft.au=Vel%C3%A1zquez-Rodr%C3%ADguez,%20Camilo&rft.date=2023-04&rft.volume=227&rft.spage=102941&rft.pages=102941-&rft.artnum=102941&rft.issn=0167-6423&rft.eissn=1872-7964&rft_id=info:doi/10.1016/j.scico.2023.102941&rft_dat=%3Celsevier_cross%3ES0167642323000230%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c303t-8ac91671c2b3ce30cfa37a2dcc42a0012ace6b323e2d56ce04c575af98e3c4173%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true