Loading…

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Introduction: Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) pe...

Full description

Saved in:
Bibliographic Details
Published in:Blood 2023-11, Vol.142 (Supplement 1), p.3726-3726
Main Authors: Civettini, Ivan, Zappaterra, Arianna, Ramazzotti, Daniele, Granelli, Bianca Maria, Rindone, Giovanni, Aroldi, Andrea, Bonfanti, Stefano, Colombo, Federica, Fedele, Marilena, Grillo, Giovanni, Parma, Matteo, Perfetti, Paola, Terruzzi, Elisabetta, Gambacorti-Passerini, Carlo, Cavalca, Fabrizio
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Introduction: Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) performs well in answering single-choice clinical questions. However, its performance seems to be less satisfactory when dealing with multiple-choice questions and more intricate clinical cases (Cosima et al. 2023 EAO; Cascella et al. 2023 J Med Syst). Notably, no study has evaluated LLMs responses in the context of Transplantation Decision Making, a complex process heavily reliant on physician expertise. Additionally, most studies focused solely on GPT's performance, without considering other competitive LLMs like Llama-2 or VertexAI. Our study aims to assess the performance of LLMs in the domain of hematopoietic stem cell transplantation. Methods: We modified and anonymized the clinical histories of six hematological patients. An experienced hematologist reviewed and validated these modified clinical histories, which included demographic data, past medical history, hematology disease features (genetic data and MRD when available), treatment responses, adverse events from previous therapies, and potential donor information (related/unrelated, HLA, CMV status). We presented these clinical cases to six experienced bone marrow transplant physicians from two major JACIE accredited hospitals and 11 hematology residents from the University Milano-Bicocca. LLMs employed for the analysis were: GPT-4, VertexAI Palm 2, Llama-2 13b and 70b. LLMs were configured with different temperature settings to control token selection randomness, always maintaining low levels for more deterministic responses. A triple-blinded survey was conducted using Typeform, where both senior hematologists and residents provided anonymized responses with personal tokens. The senior hematologists, residents, and LLMs testers were unaware of the responses provided by the other groups. We calculated Fleiss K (K) and overall percentage of agreement (OA) between residents and LLMs, considering the consensus answer (CoA) among experts as the most frequent response. Subsequently, OA and K values for both residents and LLMs were compared using T- or Mann-Whitney tests with Graphpad v 10.0.1. Results: The results showed perfect agreement among experts in patient transplant eligibility assessment (K=1.0) and
ISSN:0006-4971
1528-0020
DOI:10.1182/blood-2023-185854