Loading…

Size matters: How population size influences genotype–phenotype association studies in anonymized data

[Display omitted] •Anonymization of large-scale clinical codes allows for reliable genome–phenome analysis.•Across various repository sizes full EMR most reliable.•Preserves utility for finding genome–phenome associations. Electronic medical records (EMRs) data is increasingly incorporated into geno...

Full description

Saved in:
Bibliographic Details
Published in:Journal of biomedical informatics 2014-12, Vol.52, p.243-250
Main Authors: Heatherly, Raymond, Denny, Joshua C., Haines, Jonathan L., Roden, Dan M., Malin, Bradley A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863
cites cdi_FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863
container_end_page 250
container_issue
container_start_page 243
container_title Journal of biomedical informatics
container_volume 52
creator Heatherly, Raymond
Denny, Joshua C.
Haines, Jonathan L.
Roden, Dan M.
Malin, Bradley A.
description [Display omitted] •Anonymization of large-scale clinical codes allows for reliable genome–phenome analysis.•Across various repository sizes full EMR most reliable.•Preserves utility for finding genome–phenome associations. Electronic medical records (EMRs) data is increasingly incorporated into genome–phenome association studies. Investigators hope to share data, but there are concerns it may be “re-identified” through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome–phenome association studies under various conditions. We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome–phenome association study and compare the discoveries using the protected data and the original data through the correlation (r2) of the p-values of association significance. Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome–phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000–75,000). We observed a general trend of increasing r2 for larger data set sizes: r2=0.9481 for small-sized datasets, r2=0.9493 for moderately-sized datasets, r2=0.9934 for large-sized datasets. This research implies that regardless of the overall size of an institution’s data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.
doi_str_mv 10.1016/j.jbi.2014.07.005
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4260994</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S153204641400152X</els_id><sourcerecordid>1660093156</sourcerecordid><originalsourceid>FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863</originalsourceid><addsrcrecordid>eNqNkc-KFDEQh4Mo7rr6AF6kj16mrfxPKwiyqCsseFDPIZNOdjJ0J20nvTKefAff0Ccxw4yDXmRPKaivfqTqQ-gphhYDFi-27XYdWgKYtSBbAH4PnWNOyQqYgvunWrAz9CjnLQDGnIuH6IxwoIpzdo42n8J314ymFDfnl81V-tZMaVoGU0KKTd43Q_TD4qJ1ublxMZXd5H79-DltjnVjck42HAfK0ocKhtiYmOJurAF905tiHqMH3gzZPTm-F-jLu7efL69W1x_ff7h8c72yXMmyosr4NSVUUUaoBW8F40Io6ZSCjmEisfHeMtbxdWcN9wIs9Vx2inAHUgl6gV4fcqdlPbreulhmM-hpDqOZdzqZoP_txLDRN-lWMyKg61gNeH4MmNPXxeWix5CtGwYTXVqyxkIAdBRzcQeUyeqpHvsOKGWSMEa6iuIDaueU8-z86fMY9N673urqXe-9a5C6eq8zz_7e-jTxR3QFXh0AV29_G9yssw17qX2YnS26T-E_8b8BKoHANQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1634724429</pqid></control><display><type>article</type><title>Size matters: How population size influences genotype–phenotype association studies in anonymized data</title><source>ScienceDirect Freedom Collection</source><creator>Heatherly, Raymond ; Denny, Joshua C. ; Haines, Jonathan L. ; Roden, Dan M. ; Malin, Bradley A.</creator><creatorcontrib>Heatherly, Raymond ; Denny, Joshua C. ; Haines, Jonathan L. ; Roden, Dan M. ; Malin, Bradley A.</creatorcontrib><description>[Display omitted] •Anonymization of large-scale clinical codes allows for reliable genome–phenome analysis.•Across various repository sizes full EMR most reliable.•Preserves utility for finding genome–phenome associations. Electronic medical records (EMRs) data is increasingly incorporated into genome–phenome association studies. Investigators hope to share data, but there are concerns it may be “re-identified” through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome–phenome association studies under various conditions. We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome–phenome association study and compare the discoveries using the protected data and the original data through the correlation (r2) of the p-values of association significance. Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome–phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000–75,000). We observed a general trend of increasing r2 for larger data set sizes: r2=0.9481 for small-sized datasets, r2=0.9493 for moderately-sized datasets, r2=0.9934 for large-sized datasets. This research implies that regardless of the overall size of an institution’s data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.</description><identifier>ISSN: 1532-0464</identifier><identifier>EISSN: 1532-0480</identifier><identifier>DOI: 10.1016/j.jbi.2014.07.005</identifier><identifier>PMID: 25038554</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Algorithms ; Anonymization ; Biomedical Research - methods ; Clinical codes ; Computer Simulation ; Confidentiality ; Correlation ; Data publishing ; Databases, Genetic ; Electronic Health Records ; Genetic Association Studies - statistics &amp; numerical data ; Genotype ; Humans ; Phenotype ; Polymorphism, Single Nucleotide ; Privacy ; Releasing ; Sample Size ; Strategy ; Utilities</subject><ispartof>Journal of biomedical informatics, 2014-12, Vol.52, p.243-250</ispartof><rights>2014 Elsevier Inc.</rights><rights>Copyright © 2014 Elsevier Inc. All rights reserved.</rights><rights>2014 Elsevier Inc. All rights reserved. 2014</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863</citedby><cites>FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,776,780,881,27903,27904</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/25038554$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Heatherly, Raymond</creatorcontrib><creatorcontrib>Denny, Joshua C.</creatorcontrib><creatorcontrib>Haines, Jonathan L.</creatorcontrib><creatorcontrib>Roden, Dan M.</creatorcontrib><creatorcontrib>Malin, Bradley A.</creatorcontrib><title>Size matters: How population size influences genotype–phenotype association studies in anonymized data</title><title>Journal of biomedical informatics</title><addtitle>J Biomed Inform</addtitle><description>[Display omitted] •Anonymization of large-scale clinical codes allows for reliable genome–phenome analysis.•Across various repository sizes full EMR most reliable.•Preserves utility for finding genome–phenome associations. Electronic medical records (EMRs) data is increasingly incorporated into genome–phenome association studies. Investigators hope to share data, but there are concerns it may be “re-identified” through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome–phenome association studies under various conditions. We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome–phenome association study and compare the discoveries using the protected data and the original data through the correlation (r2) of the p-values of association significance. Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome–phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000–75,000). We observed a general trend of increasing r2 for larger data set sizes: r2=0.9481 for small-sized datasets, r2=0.9493 for moderately-sized datasets, r2=0.9934 for large-sized datasets. This research implies that regardless of the overall size of an institution’s data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.</description><subject>Algorithms</subject><subject>Anonymization</subject><subject>Biomedical Research - methods</subject><subject>Clinical codes</subject><subject>Computer Simulation</subject><subject>Confidentiality</subject><subject>Correlation</subject><subject>Data publishing</subject><subject>Databases, Genetic</subject><subject>Electronic Health Records</subject><subject>Genetic Association Studies - statistics &amp; numerical data</subject><subject>Genotype</subject><subject>Humans</subject><subject>Phenotype</subject><subject>Polymorphism, Single Nucleotide</subject><subject>Privacy</subject><subject>Releasing</subject><subject>Sample Size</subject><subject>Strategy</subject><subject>Utilities</subject><issn>1532-0464</issn><issn>1532-0480</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><recordid>eNqNkc-KFDEQh4Mo7rr6AF6kj16mrfxPKwiyqCsseFDPIZNOdjJ0J20nvTKefAff0Ccxw4yDXmRPKaivfqTqQ-gphhYDFi-27XYdWgKYtSBbAH4PnWNOyQqYgvunWrAz9CjnLQDGnIuH6IxwoIpzdo42n8J314ymFDfnl81V-tZMaVoGU0KKTd43Q_TD4qJ1ublxMZXd5H79-DltjnVjck42HAfK0ocKhtiYmOJurAF905tiHqMH3gzZPTm-F-jLu7efL69W1x_ff7h8c72yXMmyosr4NSVUUUaoBW8F40Io6ZSCjmEisfHeMtbxdWcN9wIs9Vx2inAHUgl6gV4fcqdlPbreulhmM-hpDqOZdzqZoP_txLDRN-lWMyKg61gNeH4MmNPXxeWix5CtGwYTXVqyxkIAdBRzcQeUyeqpHvsOKGWSMEa6iuIDaueU8-z86fMY9N673urqXe-9a5C6eq8zz_7e-jTxR3QFXh0AV29_G9yssw17qX2YnS26T-E_8b8BKoHANQ</recordid><startdate>20141201</startdate><enddate>20141201</enddate><creator>Heatherly, Raymond</creator><creator>Denny, Joshua C.</creator><creator>Haines, Jonathan L.</creator><creator>Roden, Dan M.</creator><creator>Malin, Bradley A.</creator><general>Elsevier Inc</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>5PM</scope></search><sort><creationdate>20141201</creationdate><title>Size matters: How population size influences genotype–phenotype association studies in anonymized data</title><author>Heatherly, Raymond ; Denny, Joshua C. ; Haines, Jonathan L. ; Roden, Dan M. ; Malin, Bradley A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Algorithms</topic><topic>Anonymization</topic><topic>Biomedical Research - methods</topic><topic>Clinical codes</topic><topic>Computer Simulation</topic><topic>Confidentiality</topic><topic>Correlation</topic><topic>Data publishing</topic><topic>Databases, Genetic</topic><topic>Electronic Health Records</topic><topic>Genetic Association Studies - statistics &amp; numerical data</topic><topic>Genotype</topic><topic>Humans</topic><topic>Phenotype</topic><topic>Polymorphism, Single Nucleotide</topic><topic>Privacy</topic><topic>Releasing</topic><topic>Sample Size</topic><topic>Strategy</topic><topic>Utilities</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Heatherly, Raymond</creatorcontrib><creatorcontrib>Denny, Joshua C.</creatorcontrib><creatorcontrib>Haines, Jonathan L.</creatorcontrib><creatorcontrib>Roden, Dan M.</creatorcontrib><creatorcontrib>Malin, Bradley A.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of biomedical informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Heatherly, Raymond</au><au>Denny, Joshua C.</au><au>Haines, Jonathan L.</au><au>Roden, Dan M.</au><au>Malin, Bradley A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Size matters: How population size influences genotype–phenotype association studies in anonymized data</atitle><jtitle>Journal of biomedical informatics</jtitle><addtitle>J Biomed Inform</addtitle><date>2014-12-01</date><risdate>2014</risdate><volume>52</volume><spage>243</spage><epage>250</epage><pages>243-250</pages><issn>1532-0464</issn><eissn>1532-0480</eissn><abstract>[Display omitted] •Anonymization of large-scale clinical codes allows for reliable genome–phenome analysis.•Across various repository sizes full EMR most reliable.•Preserves utility for finding genome–phenome associations. Electronic medical records (EMRs) data is increasingly incorporated into genome–phenome association studies. Investigators hope to share data, but there are concerns it may be “re-identified” through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome–phenome association studies under various conditions. We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome–phenome association study and compare the discoveries using the protected data and the original data through the correlation (r2) of the p-values of association significance. Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome–phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000–75,000). We observed a general trend of increasing r2 for larger data set sizes: r2=0.9481 for small-sized datasets, r2=0.9493 for moderately-sized datasets, r2=0.9934 for large-sized datasets. This research implies that regardless of the overall size of an institution’s data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>25038554</pmid><doi>10.1016/j.jbi.2014.07.005</doi><tpages>8</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1532-0464
ispartof Journal of biomedical informatics, 2014-12, Vol.52, p.243-250
issn 1532-0464
1532-0480
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4260994
source ScienceDirect Freedom Collection
subjects Algorithms
Anonymization
Biomedical Research - methods
Clinical codes
Computer Simulation
Confidentiality
Correlation
Data publishing
Databases, Genetic
Electronic Health Records
Genetic Association Studies - statistics & numerical data
Genotype
Humans
Phenotype
Polymorphism, Single Nucleotide
Privacy
Releasing
Sample Size
Strategy
Utilities
title Size matters: How population size influences genotype–phenotype association studies in anonymized data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T12%3A08%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Size%20matters:%20How%20population%20size%20influences%20genotype%E2%80%93phenotype%20association%20studies%20in%20anonymized%20data&rft.jtitle=Journal%20of%20biomedical%20informatics&rft.au=Heatherly,%20Raymond&rft.date=2014-12-01&rft.volume=52&rft.spage=243&rft.epage=250&rft.pages=243-250&rft.issn=1532-0464&rft.eissn=1532-0480&rft_id=info:doi/10.1016/j.jbi.2014.07.005&rft_dat=%3Cproquest_pubme%3E1660093156%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c587t-38afb32383423c0fc6456687e880941271affc4495b9ca5f60c3f579825e07863%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1634724429&rft_id=info:pmid/25038554&rfr_iscdi=true