Loading…

Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research

Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2020-01
Main Authors: Birnbaum, Benjamin, Nussbaum, Nathan, Seidl-Rathkopf, Katharina, Agrawal, Monica, Estevez, Melissa, Estola, Evan, Haimson, Joshua, He, Lucy, Larson, Peter, Richardson, Paul
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Birnbaum, Benjamin
Nussbaum, Nathan
Seidl-Rathkopf, Katharina
Agrawal, Monica
Estevez, Melissa
Estola, Evan
Haimson, Joshua
He, Lucy
Larson, Peter
Richardson, Paul
description Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2347070628</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2347070628</sourcerecordid><originalsourceid>FETCH-proquest_journals_23470706283</originalsourceid><addsrcrecordid>eNqNzcFqAjEQxvFQKLi0vsOA54U0Ude7bNlLL9K7jHHcjcSMncnS-vYuxQfw9F1-H_8XUznvP-rN0rmZmauerbVu3bjVylfm74uPlGpUjVroCIEHlgJKiUKJnOE3lgEOERUwY7pNDE4s0FMmwRJzDwmlp1oDJnrcJyJ8gTIQtN3u33MOnLi_gZASShjezesJk9L8sW9m8dl-b7v6Kvwzkpb9mUeZkrp3ftnYxq7dxj-n7s6ETjw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2347070628</pqid></control><display><type>article</type><title>Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research</title><source>Publicly Available Content (ProQuest)</source><creator>Birnbaum, Benjamin ; Nussbaum, Nathan ; Seidl-Rathkopf, Katharina ; Agrawal, Monica ; Estevez, Melissa ; Estola, Evan ; Haimson, Joshua ; He, Lucy ; Larson, Peter ; Richardson, Paul</creator><creatorcontrib>Birnbaum, Benjamin ; Nussbaum, Nathan ; Seidl-Rathkopf, Katharina ; Agrawal, Monica ; Estevez, Melissa ; Estola, Evan ; Haimson, Joshua ; He, Lucy ; Larson, Peter ; Richardson, Paul</creatorcontrib><description>Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Bias ; Demographic variables ; Demographics ; Efficiency ; Electronic health records ; Machine learning ; Regression analysis ; Unstructured data</subject><ispartof>arXiv.org, 2020-01</ispartof><rights>2020. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2347070628?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Birnbaum, Benjamin</creatorcontrib><creatorcontrib>Nussbaum, Nathan</creatorcontrib><creatorcontrib>Seidl-Rathkopf, Katharina</creatorcontrib><creatorcontrib>Agrawal, Monica</creatorcontrib><creatorcontrib>Estevez, Melissa</creatorcontrib><creatorcontrib>Estola, Evan</creatorcontrib><creatorcontrib>Haimson, Joshua</creatorcontrib><creatorcontrib>He, Lucy</creatorcontrib><creatorcontrib>Larson, Peter</creatorcontrib><creatorcontrib>Richardson, Paul</creatorcontrib><title>Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research</title><title>arXiv.org</title><description>Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.</description><subject>Algorithms</subject><subject>Bias</subject><subject>Demographic variables</subject><subject>Demographics</subject><subject>Efficiency</subject><subject>Electronic health records</subject><subject>Machine learning</subject><subject>Regression analysis</subject><subject>Unstructured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNzcFqAjEQxvFQKLi0vsOA54U0Ude7bNlLL9K7jHHcjcSMncnS-vYuxQfw9F1-H_8XUznvP-rN0rmZmauerbVu3bjVylfm74uPlGpUjVroCIEHlgJKiUKJnOE3lgEOERUwY7pNDE4s0FMmwRJzDwmlp1oDJnrcJyJ8gTIQtN3u33MOnLi_gZASShjezesJk9L8sW9m8dl-b7v6Kvwzkpb9mUeZkrp3ftnYxq7dxj-n7s6ETjw</recordid><startdate>20200113</startdate><enddate>20200113</enddate><creator>Birnbaum, Benjamin</creator><creator>Nussbaum, Nathan</creator><creator>Seidl-Rathkopf, Katharina</creator><creator>Agrawal, Monica</creator><creator>Estevez, Melissa</creator><creator>Estola, Evan</creator><creator>Haimson, Joshua</creator><creator>He, Lucy</creator><creator>Larson, Peter</creator><creator>Richardson, Paul</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20200113</creationdate><title>Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research</title><author>Birnbaum, Benjamin ; Nussbaum, Nathan ; Seidl-Rathkopf, Katharina ; Agrawal, Monica ; Estevez, Melissa ; Estola, Evan ; Haimson, Joshua ; He, Lucy ; Larson, Peter ; Richardson, Paul</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23470706283</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Bias</topic><topic>Demographic variables</topic><topic>Demographics</topic><topic>Efficiency</topic><topic>Electronic health records</topic><topic>Machine learning</topic><topic>Regression analysis</topic><topic>Unstructured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Birnbaum, Benjamin</creatorcontrib><creatorcontrib>Nussbaum, Nathan</creatorcontrib><creatorcontrib>Seidl-Rathkopf, Katharina</creatorcontrib><creatorcontrib>Agrawal, Monica</creatorcontrib><creatorcontrib>Estevez, Melissa</creatorcontrib><creatorcontrib>Estola, Evan</creatorcontrib><creatorcontrib>Haimson, Joshua</creatorcontrib><creatorcontrib>He, Lucy</creatorcontrib><creatorcontrib>Larson, Peter</creatorcontrib><creatorcontrib>Richardson, Paul</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Birnbaum, Benjamin</au><au>Nussbaum, Nathan</au><au>Seidl-Rathkopf, Katharina</au><au>Agrawal, Monica</au><au>Estevez, Melissa</au><au>Estola, Evan</au><au>Haimson, Joshua</au><au>He, Lucy</au><au>Larson, Peter</au><au>Richardson, Paul</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research</atitle><jtitle>arXiv.org</jtitle><date>2020-01-13</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-01
issn 2331-8422
language eng
recordid cdi_proquest_journals_2347070628
source Publicly Available Content (ProQuest)
subjects Algorithms
Bias
Demographic variables
Demographics
Efficiency
Electronic health records
Machine learning
Regression analysis
Unstructured data
title Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T21%3A09%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Model-assisted%20cohort%20selection%20with%20bias%20analysis%20for%20generating%20large-scale%20cohorts%20from%20the%20EHR%20for%20oncology%20research&rft.jtitle=arXiv.org&rft.au=Birnbaum,%20Benjamin&rft.date=2020-01-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2347070628%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_23470706283%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2347070628&rft_id=info:pmid/&rfr_iscdi=true