Loading…

Abstract P068: A hybrid modelling approach for abstracting CT imaging indications by integrating natural language processing from radiology reports with structured data from electronic health records

Background: Real-world evidence (RWE) studies for surveillance patterns following lung cancer (LC) diagnosis can inform optimizing recommendations on surveillance and practice. One major obstacle in RWE studies for LC surveillance is the lack of radiologic imaging indication for surveillance vs. oth...

Full description

Saved in:
Bibliographic Details
Published in:Cancer prevention research (Philadelphia, Pa.) Pa.), 2023-01, Vol.16 (1_Supplement), p.P068-P068
Main Authors: Khan, Aparajita, Wu, Julie, Choi, Eunji, Graber-Naidich, Anna, Henry, Solomon, Wakelee, Heather A., Kurian, Allison W., Liang, Su-Ying, Leung, Ann, Langlotz, Curtis, Backhus, Leah M., Han, Summer S.
Format: Article
Language:English
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background: Real-world evidence (RWE) studies for surveillance patterns following lung cancer (LC) diagnosis can inform optimizing recommendations on surveillance and practice. One major obstacle in RWE studies for LC surveillance is the lack of radiologic imaging indication for surveillance vs. other reasons (e.g., symptoms). To enable RWE studies for surveillance to detect second primary lung cancer among LC survivors, we developed a hybrid modelling approach that integrates structured data from electronic health records (EHRs) with natural language processing (NLP) from radiology reports for abstracting computed tomography (CT) imaging indications in LC survivors. Methods: We manually reviewed and abstracted CT imaging indications, i.e., surveillance vs. others (e.g., symptoms and metastatic disease follow-up) to create a gold standard from 200 randomly selected radiology reports among 1,952 LC patients (i) who were diagnosed in 2000-2017 at Stanford Health Care (SHC) and (ii) survived ≧5 years after the diagnosis. We abstracted medically relevant key-phrases using the part-of-speech grammar and PageRank algorithms. Hierarchical clustering identified context-specific key-phrase clusters as follows: “surveillance”, “stable”, “nodule”, “symptom”, and “metastasis”. The text-based radiology reports were vectorized to generate NLP features using phrase occurrence frequencies. The structured variables from EHRs included: (i) diagnosis of lung diseases or chest symptoms in previous 6 months, (ii) ordering provider-type (oncology vs. others [e.g. emergency and internal medicine]), and (iii) time from previous CT (≧6 months). A hybrid model was then fitted using logistic regression including both structured and NLP features and validated using a 10-fold cross-validation. The model’s performance was compared to those solely based on NLP or structured data. Results: The dataset of 200 radiology reports included 141 LC survivors (49% White, 72% adenocarcinoma). The proposed hybrid model showed high discrimination (AUC: 0.92), outperforming those based solely on NLP (AUC: 0.88) or structured data (AUC: 0.87). The proposed model demonstrated higher sensitivity (SN: 0.73) and specificity (SP: 0.96) versus those solely based on NLP (SN: 0.53; SP: 0.96) or structured data (SN: 0.53; SP: 0.90). The hybrid model showed that the following variables were positively associated with a higher likelihood that the given CT imaging indication is “surveillance”: (i) a longer time
ISSN:1940-6215
1940-6215
DOI:10.1158/1940-6215.PrecPrev22-P068