Loading…

Automatic Personalized Impression Generation for PET Reports Using Large Language Models

To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as re...

Full description

Saved in:
Bibliographic Details
Published in:ArXiv.org 2023-10
Main Authors: Tie, Xin, Shin, Muheon, Pirasteh, Ali, Ibrahim, Nevein, Huemann, Zachary, Castellino, Sharon M, Kelly, Kara M, Garrett, John, Hu, Junjie, Cho, Steve Y, Bradshaw, Tyler J
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title ArXiv.org
container_volume
creator Tie, Xin
Shin, Muheon
Pirasteh, Ali
Ibrahim, Nevein
Huemann, Zachary
Castellino, Sharon M
Kelly, Kara M
Garrett, John
Hu, Junjie
Cho, Steve Y
Bradshaw, Tyler J
description To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's correlations ( =0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, =0.41). Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10614982</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2884677202</sourcerecordid><originalsourceid>FETCH-LOGICAL-p1122-c38020e2ddbc6da557b0eaf3549d3f75f40a1debeca717cc91a66706569aef8c3</originalsourceid><addsrcrecordid>eNpVkF1LwzAUhoMobsz9BemlN4V8tEl7JWPMOZg4ZAPvSpqc1kib1KQV9Ndbccq8OeeB8_K8cM7QlDJG4iyh9PyEJ2gewivGmHJB05RdogkTOU4Ey6boeTH0rpW9UdEOfHBWNuYTdLRpOw8hGGejNVjwY2LEyvlot9pHT9A534foEIyto630NYzT1oMc4cFpaMIVuqhkE2B-3DN0uFvtl_fx9nG9WS62cUcIpbFiGaYYqNal4lqmqSgxyIqlSa5ZJdIqwZJoKEFJQYRSOZGcC8xTnkuoMsVm6PbH2w1lC1qB7b1sis6bVvqPwklT_L9Y81LU7r0gmJMkz-houDkavHsbIPRFa4KCppEW3BAKmmUJF4Li7-j1adlfy-8_2RcJBXU_</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2884677202</pqid></control><display><type>article</type><title>Automatic Personalized Impression Generation for PET Reports Using Large Language Models</title><source>Publicly Available Content Database</source><creator>Tie, Xin ; Shin, Muheon ; Pirasteh, Ali ; Ibrahim, Nevein ; Huemann, Zachary ; Castellino, Sharon M ; Kelly, Kara M ; Garrett, John ; Hu, Junjie ; Cho, Steve Y ; Bradshaw, Tyler J</creator><creatorcontrib>Tie, Xin ; Shin, Muheon ; Pirasteh, Ali ; Ibrahim, Nevein ; Huemann, Zachary ; Castellino, Sharon M ; Kelly, Kara M ; Garrett, John ; Hu, Junjie ; Cho, Steve Y ; Bradshaw, Tyler J</creatorcontrib><description>To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's correlations ( =0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, =0.41). Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.</description><identifier>ISSN: 2331-8422</identifier><identifier>EISSN: 2331-8422</identifier><identifier>PMID: 37904738</identifier><language>eng</language><publisher>United States: Cornell University</publisher><ispartof>ArXiv.org, 2023-10</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,37012</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37904738$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Shin, Muheon</creatorcontrib><creatorcontrib>Pirasteh, Ali</creatorcontrib><creatorcontrib>Ibrahim, Nevein</creatorcontrib><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Castellino, Sharon M</creatorcontrib><creatorcontrib>Kelly, Kara M</creatorcontrib><creatorcontrib>Garrett, John</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Cho, Steve Y</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><title>Automatic Personalized Impression Generation for PET Reports Using Large Language Models</title><title>ArXiv.org</title><addtitle>ArXiv</addtitle><description>To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's correlations ( =0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, =0.41). Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.</description><issn>2331-8422</issn><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpVkF1LwzAUhoMobsz9BemlN4V8tEl7JWPMOZg4ZAPvSpqc1kib1KQV9Ndbccq8OeeB8_K8cM7QlDJG4iyh9PyEJ2gewivGmHJB05RdogkTOU4Ey6boeTH0rpW9UdEOfHBWNuYTdLRpOw8hGGejNVjwY2LEyvlot9pHT9A534foEIyto630NYzT1oMc4cFpaMIVuqhkE2B-3DN0uFvtl_fx9nG9WS62cUcIpbFiGaYYqNal4lqmqSgxyIqlSa5ZJdIqwZJoKEFJQYRSOZGcC8xTnkuoMsVm6PbH2w1lC1qB7b1sis6bVvqPwklT_L9Y81LU7r0gmJMkz-houDkavHsbIPRFa4KCppEW3BAKmmUJF4Li7-j1adlfy-8_2RcJBXU_</recordid><startdate>20231017</startdate><enddate>20231017</enddate><creator>Tie, Xin</creator><creator>Shin, Muheon</creator><creator>Pirasteh, Ali</creator><creator>Ibrahim, Nevein</creator><creator>Huemann, Zachary</creator><creator>Castellino, Sharon M</creator><creator>Kelly, Kara M</creator><creator>Garrett, John</creator><creator>Hu, Junjie</creator><creator>Cho, Steve Y</creator><creator>Bradshaw, Tyler J</creator><general>Cornell University</general><scope>NPM</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20231017</creationdate><title>Automatic Personalized Impression Generation for PET Reports Using Large Language Models</title><author>Tie, Xin ; Shin, Muheon ; Pirasteh, Ali ; Ibrahim, Nevein ; Huemann, Zachary ; Castellino, Sharon M ; Kelly, Kara M ; Garrett, John ; Hu, Junjie ; Cho, Steve Y ; Bradshaw, Tyler J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p1122-c38020e2ddbc6da557b0eaf3549d3f75f40a1debeca717cc91a66706569aef8c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Shin, Muheon</creatorcontrib><creatorcontrib>Pirasteh, Ali</creatorcontrib><creatorcontrib>Ibrahim, Nevein</creatorcontrib><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Castellino, Sharon M</creatorcontrib><creatorcontrib>Kelly, Kara M</creatorcontrib><creatorcontrib>Garrett, John</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Cho, Steve Y</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><collection>PubMed</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>ArXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tie, Xin</au><au>Shin, Muheon</au><au>Pirasteh, Ali</au><au>Ibrahim, Nevein</au><au>Huemann, Zachary</au><au>Castellino, Sharon M</au><au>Kelly, Kara M</au><au>Garrett, John</au><au>Hu, Junjie</au><au>Cho, Steve Y</au><au>Bradshaw, Tyler J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automatic Personalized Impression Generation for PET Reports Using Large Language Models</atitle><jtitle>ArXiv.org</jtitle><addtitle>ArXiv</addtitle><date>2023-10-17</date><risdate>2023</risdate><issn>2331-8422</issn><eissn>2331-8422</eissn><abstract>To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's correlations ( =0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, =0.41). Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.</abstract><cop>United States</cop><pub>Cornell University</pub><pmid>37904738</pmid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2331-8422
ispartof ArXiv.org, 2023-10
issn 2331-8422
2331-8422
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10614982
source Publicly Available Content Database
title Automatic Personalized Impression Generation for PET Reports Using Large Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T17%3A54%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automatic%20Personalized%20Impression%20Generation%20for%20PET%20Reports%20Using%20Large%20Language%20Models&rft.jtitle=ArXiv.org&rft.au=Tie,%20Xin&rft.date=2023-10-17&rft.issn=2331-8422&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest_pubme%3E2884677202%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-p1122-c38020e2ddbc6da557b0eaf3549d3f75f40a1debeca717cc91a66706569aef8c3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2884677202&rft_id=info:pmid/37904738&rfr_iscdi=true