Loading…

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty...

Full description

Saved in:
Bibliographic Details
Published in:Journal of general internal medicine : JGIM 2024-10
Main Authors: Sreedhar, Radhika, Chang, Linda, Gangopadhyaya, Ananya, Shiels, Peggy Woziwodzki, Loza, Julie, Chi, Euna, Gabel, Elizabeth, Park, Yoon Soo
Format: Article
Language:English
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses. To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education. This was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year. An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade. Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied. In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P 
ISSN:0884-8734
1525-1497
1525-1497
DOI:10.1007/s11606-024-09050-9