Loading…
Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Introduction: Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) pe...
Saved in:
Published in: | Blood 2023-11, Vol.142 (Supplement 1), p.3726-3726 |
---|---|
Main Authors: | , , , , , , , , , , , , , , |
Format: | Article |
Language: | English |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Introduction:
Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) performs well in answering single-choice clinical questions. However, its performance seems to be less satisfactory when dealing with multiple-choice questions and more intricate clinical cases (Cosima et al. 2023 EAO; Cascella et al. 2023 J Med Syst). Notably, no study has evaluated LLMs responses in the context of Transplantation Decision Making, a complex process heavily reliant on physician expertise. Additionally, most studies focused solely on GPT's performance, without considering other competitive LLMs like Llama-2 or VertexAI. Our study aims to assess the performance of LLMs in the domain of hematopoietic stem cell transplantation.
Methods:
We modified and anonymized the clinical histories of six hematological patients. An experienced hematologist reviewed and validated these modified clinical histories, which included demographic data, past medical history, hematology disease features (genetic data and MRD when available), treatment responses, adverse events from previous therapies, and potential donor information (related/unrelated, HLA, CMV status).
We presented these clinical cases to six experienced bone marrow transplant physicians from two major JACIE accredited hospitals and 11 hematology residents from the University Milano-Bicocca. LLMs employed for the analysis were: GPT-4, VertexAI Palm 2, Llama-2 13b and 70b. LLMs were configured with different temperature settings to control token selection randomness, always maintaining low levels for more deterministic responses.
A triple-blinded survey was conducted using Typeform, where both senior hematologists and residents provided anonymized responses with personal tokens. The senior hematologists, residents, and LLMs testers were unaware of the responses provided by the other groups. We calculated Fleiss K (K) and overall percentage of agreement (OA) between residents and LLMs, considering the consensus answer (CoA) among experts as the most frequent response. Subsequently, OA and K values for both residents and LLMs were compared using T- or Mann-Whitney tests with Graphpad v 10.0.1.
Results:
The results showed perfect agreement among experts in patient transplant eligibility assessment (K=1.0) and |
---|---|
ISSN: | 0006-4971 1528-0020 |
DOI: | 10.1182/blood-2023-185854 |