Loading…

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on software engineering 2024-09, Vol.50 (9), p.2254-2268
Main Authors:	Fakhoury, Sarah, Naik, Aaditya, Sakkas, Georgios, Chakraborty, Saikat, Lahiri, Shuvendu K.
Format:	Article
Language:	English
Subjects:	Accuracy Artificial intelligence Benchmark testing code generation Codes cognitive load Datasets human factors Intent disambiguation Large language models LLMs Natural languages Python Task analysis test generation Workflow
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites	cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3
container_end_page	2268
container_issue	9
container_start_page	2254
container_title	IEEE transactions on software engineering
container_volume	50
creator	Fakhoury, Sarah Naik, Aaditya Sakkas, Georgios Chakraborty, Saikat Lahiri, Shuvendu K.
description	Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
doi_str_mv	10.1109/TSE.2024.3428972
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3106491351</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10606356</ieee_id><sourcerecordid>3106491351</sourcerecordid><originalsourceid>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</originalsourceid><addsrcrecordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3106491351</pqid></control><display><type>article</type><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creator><creatorcontrib>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</creatorcontrib><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2024.3428972</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Accuracy ; Artificial intelligence ; Benchmark testing ; code generation ; Codes ; cognitive load ; Datasets ; human factors ; Intent disambiguation ; Large language models ; LLMs ; Natural languages ; Python ; Task analysis ; test generation ; Workflow</subject><ispartof>IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268</ispartof><rights>Copyright IEEE Computer Society 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</cites><orcidid>0000-0002-8486-7749 ; 0000-0002-1071-8038 ; 0000-0002-3100-0455 ; 0000-0002-6889-7171 ; 0000-0002-4446-4777</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10606356$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</description><subject>Accuracy</subject><subject>Artificial intelligence</subject><subject>Benchmark testing</subject><subject>code generation</subject><subject>Codes</subject><subject>cognitive load</subject><subject>Datasets</subject><subject>human factors</subject><subject>Intent disambiguation</subject><subject>Large language models</subject><subject>LLMs</subject><subject>Natural languages</subject><subject>Python</subject><subject>Task analysis</subject><subject>test generation</subject><subject>Workflow</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkD1PwzAQhi0EEqWwMzBYYk7xZ2KzQQmlUhBDW1bLiS9SqjYpdlKp_x6XdmA63el573QPQveUTCgl-mm5yCeMMDHhgimdsQs0oprrhEtGLtGIEK0SKZW-RjchrAkhMsvkCH0XxWfyagM4vITQJ2--2UOL520P3lZ9bPC0c4Bn0MZB33TtM14F8HjRD-6Abetwvt01vqnsBud7uxn-oFt0VdtNgLtzHaPVe76cfiTF12w-fSmSitGsT2xZl2UFjmmRZrVySiiplau10qoSTEoNKROZs8BLRZyqFROOpk5GzNGy4mP0eNq7893PEB8w627wbTxpOCWp0JRLGilyoirfheChNjvfbK0_GErM0Z6J9szRnjnbi5GHU6QBgH94SlIuU_4L_SRqyg</recordid><startdate>20240901</startdate><enddate>20240901</enddate><creator>Fakhoury, Sarah</creator><creator>Naik, Aaditya</creator><creator>Sakkas, Georgios</creator><creator>Chakraborty, Saikat</creator><creator>Lahiri, Shuvendu K.</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid></search><sort><creationdate>20240901</creationdate><title>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</title><author>Fakhoury, Sarah ; Naik, Aaditya ; Sakkas, Georgios ; Chakraborty, Saikat ; Lahiri, Shuvendu K.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial intelligence</topic><topic>Benchmark testing</topic><topic>code generation</topic><topic>Codes</topic><topic>cognitive load</topic><topic>Datasets</topic><topic>human factors</topic><topic>Intent disambiguation</topic><topic>Large language models</topic><topic>LLMs</topic><topic>Natural languages</topic><topic>Python</topic><topic>Task analysis</topic><topic>test generation</topic><topic>Workflow</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fakhoury, Sarah</creatorcontrib><creatorcontrib>Naik, Aaditya</creatorcontrib><creatorcontrib>Sakkas, Georgios</creatorcontrib><creatorcontrib>Chakraborty, Saikat</creatorcontrib><creatorcontrib>Lahiri, Shuvendu K.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fakhoury, Sarah</au><au>Naik, Aaditya</au><au>Sakkas, Georgios</au><au>Chakraborty, Saikat</au><au>Lahiri, Shuvendu K.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2024-09-01</date><risdate>2024</risdate><volume>50</volume><issue>9</issue><spage>2254</spage><epage>2268</epage><pages>2254-2268</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract>Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2024.3428972</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-8486-7749</orcidid><orcidid>https://orcid.org/0000-0002-1071-8038</orcidid><orcidid>https://orcid.org/0000-0002-3100-0455</orcidid><orcidid>https://orcid.org/0000-0002-6889-7171</orcidid><orcidid>https://orcid.org/0000-0002-4446-4777</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0098-5589
ispartof	IEEE transactions on software engineering, 2024-09, Vol.50 (9), p.2254-2268
issn	0098-5589 1939-3520
language	eng
recordid	cdi_proquest_journals_3106491351
source	IEEE Electronic Library (IEL) Journals
subjects	Accuracy Artificial intelligence Benchmark testing code generation Codes cognitive load Datasets human factors Intent disambiguation Large language models LLMs Natural languages Python Task analysis test generation Workflow
title	LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T13%3A15%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LLM-Based%20Test-Driven%20Interactive%20Code%20Generation:%20User%20Study%20and%20Empirical%20Evaluation&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Fakhoury,%20Sarah&rft.date=2024-09-01&rft.volume=50&rft.issue=9&rft.spage=2254&rft.epage=2268&rft.pages=2254-2268&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2024.3428972&rft_dat=%3Cproquest_cross%3E3106491351%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c217t-abfbbced29467f8d848598df9898c42559e6247dae3b80d8f824d16d5859d1bc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3106491351&rft_id=info:pmid/&rft_ieee_id=10606356&rfr_iscdi=true