Loading…

A Comparative Study of Responses to Retina Questions from either Experts, Expert-Edited Large Language Models (LLMs) or LLMs Alone

AbstractObjectiveTo assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created and LLM responses to common retina patient questions DesignRandomized, masked multicenter study ParticipantsTwenty-one common retina patient questions were randomly assigned...

Full description

Saved in:
Bibliographic Details
Published in:Ophthalmology science (Online) 2024, p.100485-100485
Main Authors: Tailor, Prashant D., M.D, Dalvin, Lauren A., M.D, Chen, John J., M.D., Ph.D, Iezzi, Raymond, M.D, Olsen, Timothy W., M.D, Scruggs, Brittni A., M.D., Ph.D, Barkmeier, Andrew J., M.D, Bakri, Sophie J., M.D, Ryan, Edwin H., M.D, Tang, Peter H., M.D., Ph.D, Parke, D. Wilkin., M.D, Belin, Peter J., M.D, Sridhar, Jayanth, M.D, Xu, David, M.D, Kuriyan, Ajay E., M.D, Yonekawa, Yoshihiro, M.D, Starr, Matthew R., M.D
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:AbstractObjectiveTo assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created and LLM responses to common retina patient questions DesignRandomized, masked multicenter study ParticipantsTwenty-one common retina patient questions were randomly assigned among 13 retina specialists. Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert+AI), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) also generated responses to each question. The original question along with anonymized and randomized Expert+AI, Expert and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content). Main OutcomeMean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type ResultsThere were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (p
ISSN:2666-9145
2666-9145
DOI:10.1016/j.xops.2024.100485