Loading…

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering 2024-09, Vol.50 (9), p.2254-2268
Main Authors: Fakhoury, Sarah, Naik, Aaditya, Sakkas, Georgios, Chakraborty, Saikat, Lahiri, Shuvendu K.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3
container_end_page 2268
container_issue 9
container_start_page 2254
container_title IEEE transactions on software engineering
container_volume 50
creator Fakhoury, Sarah
Naik, Aaditya
Sakkas, Georgios
Chakraborty, Saikat
Lahiri, Shuvendu K.
description Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
doi_str_mv 10.1109/TSE.2024.3428972
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3106491351</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10606356</ieee_id><sourcerecordid>3106491351</sourcerecordid><originalsourceid>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</originalsourceid><addsrcrecordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3106491351</pqid></control><display><type>article</type><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creator><creatorcontrib>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creatorcontrib><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2024.3428972</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Accuracy ; Artificial intelligence ; Benchmark testing ; code generation ; Codes ; cognitive load ; Datasets ; human factors ; Intent disambiguation ; Large language models ; LLMs ; Natural languages ; Python ; Task analysis ; test generation ; Workflow</subject><ispartof>IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268</ispartof><rights>Copyright IEEE Computer Society 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</cites><orcidid>0000-0002-8486-7749 ; 0000-0002-1071-8038 ; 0000-0002-3100-0455 ; 0000-0002-6889-7171 ; 0000-0002-4446-4777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10606356$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Benchmark testing</subject><subject>code generation</subject><subject>Codes</subject><subject>cognitive load</subject><subject>Datasets</subject><subject>human factors</subject><subject>Intent disambiguation</subject><subject>Large language models</subject><subject>LLMs</subject><subject>Natural languages</subject><subject>Python</subject><subject>Task analysis</subject><subject>test generation</subject><subject>Workflow</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</recordid><startdate>20240901</startdate><enddate>20240901</enddate><creator>Fakhoury, Sarah</creator><creator>Naik, Aaditya</creator><creator>Sakkas, Georgios</creator><creator>Chakraborty, Saikat</creator><creator>Lahiri, Shuvendu K.</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid></search><sort><creationdate>20240901</creationdate><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><author>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Benchmark testing</topic><topic>code generation</topic><topic>Codes</topic><topic>cognitive load</topic><topic>Datasets</topic><topic>human factors</topic><topic>Intent disambiguation</topic><topic>Large language models</topic><topic>LLMs</topic><topic>Natural languages</topic><topic>Python</topic><topic>Task analysis</topic><topic>test generation</topic><topic>Workflow</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fakhoury, Sarah</au><au>Naik, Aaditya</au><au>Sakkas, Georgios</au><au>Chakraborty, Saikat</au><au>Lahiri, Shuvendu K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2024-09-01</date><risdate>2024</risdate><volume>50</volume><issue>9</issue><spage>2254</spage><epage>2268</epage><pages>2254-2268</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2024.3428972</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0098-5589
ispartof IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268
issn 0098-5589
1939-3520
language eng
recordid cdi_proquest_journals_3106491351
source IEEE Electronic Library (IEL) Journals
subjects Accuracy
Artificial intelligence
Benchmark testing
code generation
Codes
cognitive load
Datasets
human factors
Intent disambiguation
Large language models
LLMs
Natural languages
Python
Task analysis
test generation
Workflow
title LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A15%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LLM-Based%20Test-Driven%20Interactive%20Code%20Generation:%20User%20Study%20and%20Empirical%20Evaluation&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Fakhoury,%20Sarah&rft.date=2024-09-01&rft.volume=50&rft.issue=9&rft.spage=2254&rft.epage=2268&rft.pages=2254-2268&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2024.3428972&rft_dat=%3Cproquest_cross%3E3106491351%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3106491351&rft_id=info:pmid/&rft_ieee_id=10606356&rfr_iscdi=true