Loading…

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have be...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bui, Nghi D. Q., Yu, Yijun, Jiang, Lingxiao
Format:	Conference Proceeding
Language:	English
Subjects:	Cloning code clone detection code retrieval code search Computer bugs Computing methodologies Computing methodologies > Machine learning Computing methodologies > Machine learning > Machine learning approaches cross language fine tuning Predictive models self supervised Software and its engineering Software and its engineering > Software creation and management Software and its engineering > Software creation and management > Software development techniques Software and its engineering > Software creation and management > Software post-development issues Software and its engineering > Software notations and tools Software and its engineering > Software notations and tools > General programming languages Software engineering Syntactics Task analysis Training unlabel data unlabelled data
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	1197
container_issue
container_start_page	1186
container_title
container_volume
creator	Bui, Nghi D. Q. Yu, Yijun Jiang, Lingxiao
description	Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode.
doi_str_mv	10.1109/ICSE43902.2021.00109
format	conference_proceeding
fullrecord	<record><control><sourceid>acm_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9402028</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9402028</ieee_id><sourcerecordid>acm_books_10_1109_ICSE43902_2021_00109</sourcerecordid><originalsourceid>FETCH-LOGICAL-a331t-ad1c37e8f254162debd0a0c33a4d8aa94671353aa207780679c09c4dd8da5d93</originalsourceid><addsrcrecordid>eNqVj91KxDAQhYM_YF33CfQVWmcymSa5lLJqYcEL9z5MmxSKrpXWG9_erCvirVcH5ptz4FPqBqFCBH_bNs8bQx50pUFjBZCPJ6pAZlei1nyqLtEw5A_H5uwPuFDrZRk7MDU7z0yFKtq3Ic3NFNOVOh_kdUnrn1yp3f1m1zyW26eHtrnblkKEH6VE7MkmN2g2WOuYuggCPZGY6ES8qS0Sk4gGax3U1vfgexOji8LR00pdH2fHlFJ4n8e9zJ_BG8gqLlM6Uun3oZumlyUghIN0-JUOB-nwLR26eUxDbpX_adEXvyxSjA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees</title><source>IEEE Xplore All Conference Series</source><creator>Bui, Nghi D. Q. ; Yu, Yijun ; Jiang, Lingxiao</creator><creatorcontrib>Bui, Nghi D. Q. ; Yu, Yijun ; Jiang, Lingxiao</creatorcontrib><description>Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode.</description><identifier>ISSN: 1558-1225</identifier><identifier>ISBN: 1450390854</identifier><identifier>ISBN: 9781450390859</identifier><identifier>ISBN: 1665402962</identifier><identifier>ISBN: 9781665402965</identifier><identifier>EISSN: 1558-1225</identifier><identifier>DOI: 10.1109/ICSE43902.2021.00109</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>Piscataway, NJ, USA: IEEE Press</publisher><subject>Cloning ; code clone detection ; code retrieval ; code search ; Computer bugs ; Computing methodologies ; Computing methodologies -- Machine learning ; Computing methodologies -- Machine learning -- Machine learning approaches ; cross language ; fine tuning ; Predictive models ; self supervised ; Software and its engineering ; Software and its engineering -- Software creation and management ; Software and its engineering -- Software creation and management -- Software development techniques ; Software and its engineering -- Software creation and management -- Software post-development issues ; Software and its engineering -- Software notations and tools ; Software and its engineering -- Software notations and tools -- General programming languages ; Software engineering ; Syntactics ; Task analysis ; Training ; unlabel data ; unlabelled data</subject><ispartof>2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, p.1186-1197</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9402028$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,23930,23931,25140,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9402028$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Bui, Nghi D. Q.</creatorcontrib><creatorcontrib>Yu, Yijun</creatorcontrib><creatorcontrib>Jiang, Lingxiao</creatorcontrib><title>InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees</title><title>2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)</title><addtitle>ICSE</addtitle><description>Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode.</description><subject>Cloning</subject><subject>code clone detection</subject><subject>code retrieval</subject><subject>code search</subject><subject>Computer bugs</subject><subject>Computing methodologies</subject><subject>Computing methodologies -- Machine learning</subject><subject>Computing methodologies -- Machine learning -- Machine learning approaches</subject><subject>cross language</subject><subject>fine tuning</subject><subject>Predictive models</subject><subject>self supervised</subject><subject>Software and its engineering</subject><subject>Software and its engineering -- Software creation and management</subject><subject>Software and its engineering -- Software creation and management -- Software development techniques</subject><subject>Software and its engineering -- Software creation and management -- Software post-development issues</subject><subject>Software and its engineering -- Software notations and tools</subject><subject>Software and its engineering -- Software notations and tools -- General programming languages</subject><subject>Software engineering</subject><subject>Syntactics</subject><subject>Task analysis</subject><subject>Training</subject><subject>unlabel data</subject><subject>unlabelled data</subject><issn>1558-1225</issn><issn>1558-1225</issn><isbn>1450390854</isbn><isbn>9781450390859</isbn><isbn>1665402962</isbn><isbn>9781665402965</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNqVj91KxDAQhYM_YF33CfQVWmcymSa5lLJqYcEL9z5MmxSKrpXWG9_erCvirVcH5ptz4FPqBqFCBH_bNs8bQx50pUFjBZCPJ6pAZlei1nyqLtEw5A_H5uwPuFDrZRk7MDU7z0yFKtq3Ic3NFNOVOh_kdUnrn1yp3f1m1zyW26eHtrnblkKEH6VE7MkmN2g2WOuYuggCPZGY6ES8qS0Sk4gGax3U1vfgexOji8LR00pdH2fHlFJ4n8e9zJ_BG8gqLlM6Uun3oZumlyUghIN0-JUOB-nwLR26eUxDbpX_adEXvyxSjA</recordid><startdate>20210522</startdate><enddate>20210522</enddate><creator>Bui, Nghi D. Q.</creator><creator>Yu, Yijun</creator><creator>Jiang, Lingxiao</creator><general>IEEE Press</general><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20210522</creationdate><title>InferCode</title><author>Bui, Nghi D. Q. ; Yu, Yijun ; Jiang, Lingxiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a331t-ad1c37e8f254162debd0a0c33a4d8aa94671353aa207780679c09c4dd8da5d93</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Cloning</topic><topic>code clone detection</topic><topic>code retrieval</topic><topic>code search</topic><topic>Computer bugs</topic><topic>Computing methodologies</topic><topic>Computing methodologies -- Machine learning</topic><topic>Computing methodologies -- Machine learning -- Machine learning approaches</topic><topic>cross language</topic><topic>fine tuning</topic><topic>Predictive models</topic><topic>self supervised</topic><topic>Software and its engineering</topic><topic>Software and its engineering -- Software creation and management</topic><topic>Software and its engineering -- Software creation and management -- Software development techniques</topic><topic>Software and its engineering -- Software creation and management -- Software post-development issues</topic><topic>Software and its engineering -- Software notations and tools</topic><topic>Software and its engineering -- Software notations and tools -- General programming languages</topic><topic>Software engineering</topic><topic>Syntactics</topic><topic>Task analysis</topic><topic>Training</topic><topic>unlabel data</topic><topic>unlabelled data</topic><toplevel>online_resources</toplevel><creatorcontrib>Bui, Nghi D. Q.</creatorcontrib><creatorcontrib>Yu, Yijun</creatorcontrib><creatorcontrib>Jiang, Lingxiao</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE/IET Electronic Library</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bui, Nghi D. Q.</au><au>Yu, Yijun</au><au>Jiang, Lingxiao</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees</atitle><btitle>2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)</btitle><stitle>ICSE</stitle><date>2021-05-22</date><risdate>2021</risdate><spage>1186</spage><epage>1197</epage><pages>1186-1197</pages><issn>1558-1225</issn><eissn>1558-1225</eissn><isbn>1450390854</isbn><isbn>9781450390859</isbn><isbn>1665402962</isbn><isbn>9781665402965</isbn><coden>IEEPAD</coden><abstract>Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode.</abstract><cop>Piscataway, NJ, USA</cop><pub>IEEE Press</pub><doi>10.1109/ICSE43902.2021.00109</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1558-1225
ispartof	2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, p.1186-1197
issn	1558-1225 1558-1225
language	eng
recordid	cdi_ieee_primary_9402028
source	IEEE Xplore All Conference Series
subjects	Cloning code clone detection code retrieval code search Computer bugs Computing methodologies Computing methodologies -- Machine learning Computing methodologies -- Machine learning -- Machine learning approaches cross language fine tuning Predictive models self supervised Software and its engineering Software and its engineering -- Software creation and management Software and its engineering -- Software creation and management -- Software development techniques Software and its engineering -- Software creation and management -- Software post-development issues Software and its engineering -- Software notations and tools Software and its engineering -- Software notations and tools -- General programming languages Software engineering Syntactics Task analysis Training unlabel data unlabelled data
title	InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T08%3A21%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=InferCode:%20Self-Supervised%20Learning%20of%20Code%20Representations%20by%20Predicting%20Subtrees&rft.btitle=2021%20IEEE/ACM%2043rd%20International%20Conference%20on%20Software%20Engineering%20(ICSE)&rft.au=Bui,%20Nghi%20D.%20Q.&rft.date=2021-05-22&rft.spage=1186&rft.epage=1197&rft.pages=1186-1197&rft.issn=1558-1225&rft.eissn=1558-1225&rft.isbn=1450390854&rft.isbn_list=9781450390859&rft.isbn_list=1665402962&rft.isbn_list=9781665402965&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICSE43902.2021.00109&rft_dat=%3Cacm_CHZPO%3Eacm_books_10_1109_ICSE43902_2021_00109%3C/acm_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a331t-ad1c37e8f254162debd0a0c33a4d8aa94671353aa207780679c09c4dd8da5d93%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9402028&rfr_iscdi=true