Loading…

Digital Ink and Surgical Dreams: Perceptions of Artificial Intelligence–Generated Essays in Residency Applications

Large language models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly used in academic writing. Faculty may consider use of artificial intelligence (AI)–generated responses a form of cheating. We sought to determine whether general surgery residency faculty could detect AI ve...

Full description

Saved in:

Bibliographic Details
Published in:	The Journal of surgical research 2024-09, Vol.301, p.504-511
Main Authors:	Crawford, Loralai M., Hendzlik, Peter, Lam, Justine, Cannon, Lisa M., Qi, Yanjie, DeCaporale-Ryan, Lauren, Wilson, Nicole A.
Format:	Article
Language:	English
Subjects:	Ethics Generative artificial intelligence Large language model Surgical education
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large language models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly used in academic writing. Faculty may consider use of artificial intelligence (AI)–generated responses a form of cheating. We sought to determine whether general surgery residency faculty could detect AI versus human-written responses to a text prompt; hypothesizing that faculty would not be able to reliably differentiate AI versus human-written responses. Ten essays were generated using a text prompt, “Tell us in 1-2 paragraphs why you are considering the University of Rochester for General Surgery residency” (Current trainees: n = 5, ChatGPT: n = 5). Ten blinded faculty reviewers rated essays (ten-point Likert scale) on the following criteria: desire to interview, relevance to the general surgery residency, overall impression, and AI- or human-generated; with scores and identification error rates compared between the groups. There were no differences between groups for %total points (ChatGPT 66.0 ± 13.5%, human 70.0 ± 23.0%, P = 0.508) or identification error rates (ChatGPT 40.0 ± 35.0%, human 20.0 ± 30.0%, P = 0.175). Except for one, all essays were identified incorrectly by at least two reviewers. Essays identified as human-generated received higher overall impression scores (area under the curve: 0.82 ± 0.04, P
ISSN:	0022-4804 1095-8673 1095-8673
DOI:	10.1016/j.jss.2024.06.020