Loading…

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision s...

Full description

Saved in:

Bibliographic Details
Published in:	NPJ digital medicine 2024-04, Vol.7 (1), p.102-102, Article 102
Main Authors:	Kresevic, Simone, Giuffrè, Mauro, Ajcevic, Milos, Accardo, Agostino, Crocè, Lory S., Shung, Dennis L.
Format:	Article
Language:	English
Subjects:	692/308/575 692/700/1538 Accuracy Biomedicine Biotechnology Chatbots Clinical decision making Clinical practice guidelines Decision support systems Engineering Experiments Hallucinations Hepatitis C Large language models Medical research Medicine Medicine & Public Health Statistical significance
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI’s GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM’s answer generation accuracy. The baseline GPT-4 Turbo model’s performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% ( p
ISSN:	2398-6352 2398-6352
DOI:	10.1038/s41746-024-01091-y