Loading…
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In...
Saved in:
Published in: | IEEE transactions on software engineering 2024-09, Vol.50 (9), p.2254-2268 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3 |
container_end_page | 2268 |
container_issue | 9 |
container_start_page | 2254 |
container_title | IEEE transactions on software engineering |
container_volume | 50 |
creator | Fakhoury, Sarah Naik, Aaditya Sakkas, Georgios Chakraborty, Saikat Lahiri, Shuvendu K. |
description | Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests. |
doi_str_mv | 10.1109/TSE.2024.3428972 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3106491351</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10606356</ieee_id><sourcerecordid>3106491351</sourcerecordid><originalsourceid>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</originalsourceid><addsrcrecordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3106491351</pqid></control><display><type>article</type><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creator><creatorcontrib>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creatorcontrib><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2024.3428972</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Accuracy ; Artificial intelligence ; Benchmark testing ; code generation ; Codes ; cognitive load ; Datasets ; human factors ; Intent disambiguation ; Large language models ; LLMs ; Natural languages ; Python ; Task analysis ; test generation ; Workflow</subject><ispartof>IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268</ispartof><rights>Copyright IEEE Computer Society 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</cites><orcidid>0000-0002-8486-7749 ; 0000-0002-1071-8038 ; 0000-0002-3100-0455 ; 0000-0002-6889-7171 ; 0000-0002-4446-4777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10606356$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Benchmark testing</subject><subject>code generation</subject><subject>Codes</subject><subject>cognitive load</subject><subject>Datasets</subject><subject>human factors</subject><subject>Intent disambiguation</subject><subject>Large language models</subject><subject>LLMs</subject><subject>Natural languages</subject><subject>Python</subject><subject>Task analysis</subject><subject>test generation</subject><subject>Workflow</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</recordid><startdate>20240901</startdate><enddate>20240901</enddate><creator>Fakhoury, Sarah</creator><creator>Naik, Aaditya</creator><creator>Sakkas, Georgios</creator><creator>Chakraborty, Saikat</creator><creator>Lahiri, Shuvendu K.</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid></search><sort><creationdate>20240901</creationdate><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><author>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Benchmark testing</topic><topic>code generation</topic><topic>Codes</topic><topic>cognitive load</topic><topic>Datasets</topic><topic>human factors</topic><topic>Intent disambiguation</topic><topic>Large language models</topic><topic>LLMs</topic><topic>Natural languages</topic><topic>Python</topic><topic>Task analysis</topic><topic>test generation</topic><topic>Workflow</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fakhoury, Sarah</au><au>Naik, Aaditya</au><au>Sakkas, Georgios</au><au>Chakraborty, Saikat</au><au>Lahiri, Shuvendu K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2024-09-01</date><risdate>2024</risdate><volume>50</volume><issue>9</issue><spage>2254</spage><epage>2268</epage><pages>2254-2268</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2024.3428972</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0098-5589 |
ispartof | IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268 |
issn | 0098-5589 1939-3520 |
language | eng |
recordid | cdi_proquest_journals_3106491351 |
source | IEEE Electronic Library (IEL) Journals |
subjects | Accuracy Artificial intelligence Benchmark testing code generation Codes cognitive load Datasets human factors Intent disambiguation Large language models LLMs Natural languages Python Task analysis test generation Workflow |
title | LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A15%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LLM-Based%20Test-Driven%20Interactive%20Code%20Generation:%20User%20Study%20and%20Empirical%20Evaluation&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Fakhoury,%20Sarah&rft.date=2024-09-01&rft.volume=50&rft.issue=9&rft.spage=2254&rft.epage=2268&rft.pages=2254-2268&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2024.3428972&rft_dat=%3Cproquest_cross%3E3106491351%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3106491351&rft_id=info:pmid/&rft_ieee_id=10606356&rfr_iscdi=true |