Loading…

Combining knowledge- and data-driven methods for de-identification of clinical narratives

[Display omitted] •We present a method for automatic de-identification of clinical narratives.•We propose and validate a two-pass tagging method to improve PHI entity recognition.•We have shown that automated de-identification is comparable to human benchmark. A recent promise to access unstructured...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomedical informatics 2015-12, Vol.58 (Suppl), p.S53-S59
Main Authors: Dehghan, Azad, Kovacevic, Aleksandar, Karystianis, George, Keane, John A., Nenadic, Goran
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3
cites cdi_FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3
container_end_page S59
container_issue Suppl
container_start_page S53
container_title Journal of biomedical informatics
container_volume 58
creator Dehghan, Azad
Kovacevic, Aleksandar
Karystianis, George
Keane, John A.
Nenadic, Goran
description [Display omitted] •We present a method for automatic de-identification of clinical narratives.•We propose and validate a two-pass tagging method to improve PHI entity recognition.•We have shown that automated de-identification is comparable to human benchmark. A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.
doi_str_mv 10.1016/j.jbi.2015.06.029
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4976126</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1532046415001392</els_id><sourcerecordid>1785230209</sourcerecordid><originalsourceid>FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3</originalsourceid><addsrcrecordid>eNqNkUuPFCEUhYnROOPoD3BjWLqpEqjiFRMT0_GVTOJGF64IBZce2ioYobqN_146PXZ0Y2Z1uZfvnhw4CD2npKeEile7fjfFnhHKeyJ6wvQDdEn5wDoyKvLwfBbjBXpS644QSjkXj9EFE4ySgetL9G2TlymmmLb4e8o_Z_Bb6LBNHnu72s6XeICEF1hvsq845II9dNFDWmOIzq4xJ5wDdnPTcHbGyZbSpgeoT9GjYOcKz-7qFfr6_t2Xzcfu-vOHT5u3153jVK6dtMA8OK8JcxN1drROSCWPPR-Vld4qgCEoxoKYIBBJh0EIK5W2k6JDGK7Qm5Pu7X5awLtmrdjZ3Ja42PLLZBvNvzcp3phtPphRS0GZaAIv7wRK_rGHupolVgfzbBPkfTVUKkH1qJi6D8rZQBjR90BHrTVvHhpKT6grudYC4WyeEnMM2uxMC9ocgzZEmBZ023nx96vPG3-SbcDrEwDt7w8RiqkuQnLgYwG3Gp_jf-R_A9cmukY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1749995976</pqid></control><display><type>article</type><title>Combining knowledge- and data-driven methods for de-identification of clinical narratives</title><source>Elsevier</source><creator>Dehghan, Azad ; Kovacevic, Aleksandar ; Karystianis, George ; Keane, John A. ; Nenadic, Goran</creator><creatorcontrib>Dehghan, Azad ; Kovacevic, Aleksandar ; Karystianis, George ; Keane, John A. ; Nenadic, Goran</creatorcontrib><description>[Display omitted] •We present a method for automatic de-identification of clinical narratives.•We propose and validate a two-pass tagging method to improve PHI entity recognition.•We have shown that automated de-identification is comparable to human benchmark. A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.</description><identifier>ISSN: 1532-0464</identifier><identifier>EISSN: 1532-0480</identifier><identifier>DOI: 10.1016/j.jbi.2015.06.029</identifier><identifier>PMID: 26210359</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Automation ; Clinical text mining ; Cohort Studies ; Computer Security ; Computer Simulation ; Confidence ; Confidentiality ; Data Mining - methods ; De-identification ; Dictionaries ; Electronic health record ; Electronic Health Records - organization &amp; administration ; Information dissemination ; Information extraction ; Machine Learning ; Models, Statistical ; Named entity recognition ; Narration ; Narratives ; Natural Language Processing ; Pattern Recognition, Automated - methods ; Profession ; Texts ; United Kingdom ; Unstructured data ; Vocabulary, Controlled</subject><ispartof>Journal of biomedical informatics, 2015-12, Vol.58 (Suppl), p.S53-S59</ispartof><rights>2015 Elsevier Inc.</rights><rights>Copyright © 2015 Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3</citedby><cites>FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3</cites><orcidid>0000-0003-3491-361X ; 0000-0003-0795-5363 ; 0000-0001-7000-2835</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,776,780,881,27903,27904</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/26210359$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dehghan, Azad</creatorcontrib><creatorcontrib>Kovacevic, Aleksandar</creatorcontrib><creatorcontrib>Karystianis, George</creatorcontrib><creatorcontrib>Keane, John A.</creatorcontrib><creatorcontrib>Nenadic, Goran</creatorcontrib><title>Combining knowledge- and data-driven methods for de-identification of clinical narratives</title><title>Journal of biomedical informatics</title><addtitle>J Biomed Inform</addtitle><description>[Display omitted] •We present a method for automatic de-identification of clinical narratives.•We propose and validate a two-pass tagging method to improve PHI entity recognition.•We have shown that automated de-identification is comparable to human benchmark. A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.</description><subject>Automation</subject><subject>Clinical text mining</subject><subject>Cohort Studies</subject><subject>Computer Security</subject><subject>Computer Simulation</subject><subject>Confidence</subject><subject>Confidentiality</subject><subject>Data Mining - methods</subject><subject>De-identification</subject><subject>Dictionaries</subject><subject>Electronic health record</subject><subject>Electronic Health Records - organization &amp; administration</subject><subject>Information dissemination</subject><subject>Information extraction</subject><subject>Machine Learning</subject><subject>Models, Statistical</subject><subject>Named entity recognition</subject><subject>Narration</subject><subject>Narratives</subject><subject>Natural Language Processing</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Profession</subject><subject>Texts</subject><subject>United Kingdom</subject><subject>Unstructured data</subject><subject>Vocabulary, Controlled</subject><issn>1532-0464</issn><issn>1532-0480</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNqNkUuPFCEUhYnROOPoD3BjWLqpEqjiFRMT0_GVTOJGF64IBZce2ioYobqN_146PXZ0Y2Z1uZfvnhw4CD2npKeEile7fjfFnhHKeyJ6wvQDdEn5wDoyKvLwfBbjBXpS644QSjkXj9EFE4ySgetL9G2TlymmmLb4e8o_Z_Bb6LBNHnu72s6XeICEF1hvsq845II9dNFDWmOIzq4xJ5wDdnPTcHbGyZbSpgeoT9GjYOcKz-7qFfr6_t2Xzcfu-vOHT5u3153jVK6dtMA8OK8JcxN1drROSCWPPR-Vld4qgCEoxoKYIBBJh0EIK5W2k6JDGK7Qm5Pu7X5awLtmrdjZ3Ja42PLLZBvNvzcp3phtPphRS0GZaAIv7wRK_rGHupolVgfzbBPkfTVUKkH1qJi6D8rZQBjR90BHrTVvHhpKT6grudYC4WyeEnMM2uxMC9ocgzZEmBZ023nx96vPG3-SbcDrEwDt7w8RiqkuQnLgYwG3Gp_jf-R_A9cmukY</recordid><startdate>20151201</startdate><enddate>20151201</enddate><creator>Dehghan, Azad</creator><creator>Kovacevic, Aleksandar</creator><creator>Karystianis, George</creator><creator>Keane, John A.</creator><creator>Nenadic, Goran</creator><general>Elsevier Inc</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-3491-361X</orcidid><orcidid>https://orcid.org/0000-0003-0795-5363</orcidid><orcidid>https://orcid.org/0000-0001-7000-2835</orcidid></search><sort><creationdate>20151201</creationdate><title>Combining knowledge- and data-driven methods for de-identification of clinical narratives</title><author>Dehghan, Azad ; Kovacevic, Aleksandar ; Karystianis, George ; Keane, John A. ; Nenadic, Goran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Automation</topic><topic>Clinical text mining</topic><topic>Cohort Studies</topic><topic>Computer Security</topic><topic>Computer Simulation</topic><topic>Confidence</topic><topic>Confidentiality</topic><topic>Data Mining - methods</topic><topic>De-identification</topic><topic>Dictionaries</topic><topic>Electronic health record</topic><topic>Electronic Health Records - organization &amp; administration</topic><topic>Information dissemination</topic><topic>Information extraction</topic><topic>Machine Learning</topic><topic>Models, Statistical</topic><topic>Named entity recognition</topic><topic>Narration</topic><topic>Narratives</topic><topic>Natural Language Processing</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Profession</topic><topic>Texts</topic><topic>United Kingdom</topic><topic>Unstructured data</topic><topic>Vocabulary, Controlled</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dehghan, Azad</creatorcontrib><creatorcontrib>Kovacevic, Aleksandar</creatorcontrib><creatorcontrib>Karystianis, George</creatorcontrib><creatorcontrib>Keane, John A.</creatorcontrib><creatorcontrib>Nenadic, Goran</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of biomedical informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dehghan, Azad</au><au>Kovacevic, Aleksandar</au><au>Karystianis, George</au><au>Keane, John A.</au><au>Nenadic, Goran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Combining knowledge- and data-driven methods for de-identification of clinical narratives</atitle><jtitle>Journal of biomedical informatics</jtitle><addtitle>J Biomed Inform</addtitle><date>2015-12-01</date><risdate>2015</risdate><volume>58</volume><issue>Suppl</issue><spage>S53</spage><epage>S59</epage><pages>S53-S59</pages><issn>1532-0464</issn><eissn>1532-0480</eissn><abstract>[Display omitted] •We present a method for automatic de-identification of clinical narratives.•We propose and validate a two-pass tagging method to improve PHI entity recognition.•We have shown that automated de-identification is comparable to human benchmark. A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>26210359</pmid><doi>10.1016/j.jbi.2015.06.029</doi><orcidid>https://orcid.org/0000-0003-3491-361X</orcidid><orcidid>https://orcid.org/0000-0003-0795-5363</orcidid><orcidid>https://orcid.org/0000-0001-7000-2835</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1532-0464
ispartof Journal of biomedical informatics, 2015-12, Vol.58 (Suppl), p.S53-S59
issn 1532-0464
1532-0480
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4976126
source Elsevier
subjects Automation
Clinical text mining
Cohort Studies
Computer Security
Computer Simulation
Confidence
Confidentiality
Data Mining - methods
De-identification
Dictionaries
Electronic health record
Electronic Health Records - organization & administration
Information dissemination
Information extraction
Machine Learning
Models, Statistical
Named entity recognition
Narration
Narratives
Natural Language Processing
Pattern Recognition, Automated - methods
Profession
Texts
United Kingdom
Unstructured data
Vocabulary, Controlled
title Combining knowledge- and data-driven methods for de-identification of clinical narratives
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T05%3A37%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Combining%20knowledge-%20and%20data-driven%20methods%20for%20de-identification%20of%20clinical%20narratives&rft.jtitle=Journal%20of%20biomedical%20informatics&rft.au=Dehghan,%20Azad&rft.date=2015-12-01&rft.volume=58&rft.issue=Suppl&rft.spage=S53&rft.epage=S59&rft.pages=S53-S59&rft.issn=1532-0464&rft.eissn=1532-0480&rft_id=info:doi/10.1016/j.jbi.2015.06.029&rft_dat=%3Cproquest_pubme%3E1785230209%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c517t-7ae2decd902cb1ca4ac6787d902548a7da8ee3f822f6bef0713366a789ab813f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1749995976&rft_id=info:pmid/26210359&rfr_iscdi=true