Loading…

Scalable probabilistic PCA for large-scale genetic variation data

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal compon...

Full description

Saved in:
Bibliographic Details
Published in:PLoS genetics 2020-05, Vol.16 (5), p.e1008773-e1008773
Main Authors: Agrawal, Aman, Chiu, Alec M, Le, Minh, Halperin, Eran, Sankararaman, Sriram
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353
cites cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353
container_end_page e1008773
container_issue 5
container_start_page e1008773
container_title PLoS genetics
container_volume 16
creator Agrawal, Aman
Chiu, Alec M
Le, Minh
Halperin, Eran
Sankararaman, Sriram
description Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.
doi_str_mv 10.1371/journal.pgen.1008773
format article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2479455466</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A632949411</galeid><doaj_id>oai_doaj_org_article_e5d0f67472ce459d9b3bfe1aef6cc88e</doaj_id><sourcerecordid>A632949411</sourcerecordid><originalsourceid>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</originalsourceid><addsrcrecordid>eNqVk99v0zAQxyMEYmPwHyCohITgocWJ7Th5QaoqflSaGGITr9bFOaeu3Lizkwn-e5w1mxq0B5AfbJ0_9707ny9JXqZkkVKRfti63rdgF_sG20VKSCEEfZScppzTuWCEPT46nyTPQtgSQnlRiqfJCc1YXhZlfposLxVYqCzO9t5VUBlrQmfU7PtqOdPOzyz4BuchQjiLgXC4uwFvoDOundXQwfPkiQYb8MW4nyVXnz9drb7Ozy--rFfL87kSZdbNqaIZpwIIg0IBQl2moKtcMMF1jkUNVFdlSkTGal4JlbKs0jlDQTHnmnJ6lrw-yO6tC3IsPsiMiZJxzvI8EusDUTvYyr03O_C_pQMjbw3ONxJ8zN-iRF4TPcTOFDJe1mVFK40poM6VKgqMWh_HaH21w1ph23mwE9HpTWs2snE3UmRFzm_TfTcKeHfdY-jkzgSF1kKLrh_yJkVGGBM0om_-Qh-ubqSa2AppWu1iXDWIymVOs5KVLE0jtXiAiqvGnVGuRW2ifeLwfuIQmQ5_dQ30Icj15Y__YL_9O3vxc8q-PWI3CLbbBGf74YuFKcgOoPIuBI_6viEpkcNQ3L2cHIZCjkMR3V4dN_Pe6W4K6B9TgAWb</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2479455466</pqid></control><display><type>article</type><title>Scalable probabilistic PCA for large-scale genetic variation data</title><source>Publicly Available Content (ProQuest)</source><source>PubMed Central</source><creator>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram</creator><contributor>Gravel, Simon</contributor><creatorcontrib>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram ; Gravel, Simon</creatorcontrib><description>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</description><identifier>ISSN: 1553-7404</identifier><identifier>ISSN: 1553-7390</identifier><identifier>EISSN: 1553-7404</identifier><identifier>DOI: 10.1371/journal.pgen.1008773</identifier><identifier>PMID: 32469896</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Accuracy ; Adaptor Proteins, Signal Transducing - genetics ; Algorithms ; Biobanks ; Biological Specimen Banks ; Biology and Life Sciences ; Computational Biology - methods ; Computer applications ; Datasets ; European Continental Ancestry Group - genetics ; Genetic analysis ; Genetic diversity ; Genetic variation ; Genetics, Population ; Genome-wide association studies ; Genome-Wide Association Study - methods ; Genomes ; Genotype &amp; phenotype ; Genotypes ; Heritability ; Humans ; Linkage disequilibrium ; Methods ; Models, Genetic ; Mutation, Missense ; Physical Sciences ; Polymorphism, Single Nucleotide ; Population ; Population genetics ; Population structure ; Positive selection ; Principal Component Analysis ; Principal components analysis ; Research and Analysis Methods ; Single-nucleotide polymorphism ; Toll-Like Receptor 4 - genetics ; United Kingdom - ethnology</subject><ispartof>PLoS genetics, 2020-05, Vol.16 (5), p.e1008773-e1008773</ispartof><rights>COPYRIGHT 2020 Public Library of Science</rights><rights>2020 Agrawal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 Agrawal et al 2020 Agrawal et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</citedby><cites>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</cites><orcidid>0000-0002-5955-9701 ; 0000-0002-1646-1149 ; 0000-0002-2373-3691</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2479455466/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2479455466?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,886,25754,27925,27926,37013,37014,44591,53792,53794,75127</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32469896$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Gravel, Simon</contributor><creatorcontrib>Agrawal, Aman</creatorcontrib><creatorcontrib>Chiu, Alec M</creatorcontrib><creatorcontrib>Le, Minh</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Sankararaman, Sriram</creatorcontrib><title>Scalable probabilistic PCA for large-scale genetic variation data</title><title>PLoS genetics</title><addtitle>PLoS Genet</addtitle><description>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</description><subject>Accuracy</subject><subject>Adaptor Proteins, Signal Transducing - genetics</subject><subject>Algorithms</subject><subject>Biobanks</subject><subject>Biological Specimen Banks</subject><subject>Biology and Life Sciences</subject><subject>Computational Biology - methods</subject><subject>Computer applications</subject><subject>Datasets</subject><subject>European Continental Ancestry Group - genetics</subject><subject>Genetic analysis</subject><subject>Genetic diversity</subject><subject>Genetic variation</subject><subject>Genetics, Population</subject><subject>Genome-wide association studies</subject><subject>Genome-Wide Association Study - methods</subject><subject>Genomes</subject><subject>Genotype &amp; phenotype</subject><subject>Genotypes</subject><subject>Heritability</subject><subject>Humans</subject><subject>Linkage disequilibrium</subject><subject>Methods</subject><subject>Models, Genetic</subject><subject>Mutation, Missense</subject><subject>Physical Sciences</subject><subject>Polymorphism, Single Nucleotide</subject><subject>Population</subject><subject>Population genetics</subject><subject>Population structure</subject><subject>Positive selection</subject><subject>Principal Component Analysis</subject><subject>Principal components analysis</subject><subject>Research and Analysis Methods</subject><subject>Single-nucleotide polymorphism</subject><subject>Toll-Like Receptor 4 - genetics</subject><subject>United Kingdom - ethnology</subject><issn>1553-7404</issn><issn>1553-7390</issn><issn>1553-7404</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNqVk99v0zAQxyMEYmPwHyCohITgocWJ7Th5QaoqflSaGGITr9bFOaeu3Lizkwn-e5w1mxq0B5AfbJ0_9707ny9JXqZkkVKRfti63rdgF_sG20VKSCEEfZScppzTuWCEPT46nyTPQtgSQnlRiqfJCc1YXhZlfposLxVYqCzO9t5VUBlrQmfU7PtqOdPOzyz4BuchQjiLgXC4uwFvoDOundXQwfPkiQYb8MW4nyVXnz9drb7Ozy--rFfL87kSZdbNqaIZpwIIg0IBQl2moKtcMMF1jkUNVFdlSkTGal4JlbKs0jlDQTHnmnJ6lrw-yO6tC3IsPsiMiZJxzvI8EusDUTvYyr03O_C_pQMjbw3ONxJ8zN-iRF4TPcTOFDJe1mVFK40poM6VKgqMWh_HaH21w1ph23mwE9HpTWs2snE3UmRFzm_TfTcKeHfdY-jkzgSF1kKLrh_yJkVGGBM0om_-Qh-ubqSa2AppWu1iXDWIymVOs5KVLE0jtXiAiqvGnVGuRW2ifeLwfuIQmQ5_dQ30Icj15Y__YL_9O3vxc8q-PWI3CLbbBGf74YuFKcgOoPIuBI_6viEpkcNQ3L2cHIZCjkMR3V4dN_Pe6W4K6B9TgAWb</recordid><startdate>20200529</startdate><enddate>20200529</enddate><creator>Agrawal, Aman</creator><creator>Chiu, Alec M</creator><creator>Le, Minh</creator><creator>Halperin, Eran</creator><creator>Sankararaman, Sriram</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QP</scope><scope>7QR</scope><scope>7SS</scope><scope>7TK</scope><scope>7TM</scope><scope>7TO</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FD</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-5955-9701</orcidid><orcidid>https://orcid.org/0000-0002-1646-1149</orcidid><orcidid>https://orcid.org/0000-0002-2373-3691</orcidid></search><sort><creationdate>20200529</creationdate><title>Scalable probabilistic PCA for large-scale genetic variation data</title><author>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Adaptor Proteins, Signal Transducing - genetics</topic><topic>Algorithms</topic><topic>Biobanks</topic><topic>Biological Specimen Banks</topic><topic>Biology and Life Sciences</topic><topic>Computational Biology - methods</topic><topic>Computer applications</topic><topic>Datasets</topic><topic>European Continental Ancestry Group - genetics</topic><topic>Genetic analysis</topic><topic>Genetic diversity</topic><topic>Genetic variation</topic><topic>Genetics, Population</topic><topic>Genome-wide association studies</topic><topic>Genome-Wide Association Study - methods</topic><topic>Genomes</topic><topic>Genotype &amp; phenotype</topic><topic>Genotypes</topic><topic>Heritability</topic><topic>Humans</topic><topic>Linkage disequilibrium</topic><topic>Methods</topic><topic>Models, Genetic</topic><topic>Mutation, Missense</topic><topic>Physical Sciences</topic><topic>Polymorphism, Single Nucleotide</topic><topic>Population</topic><topic>Population genetics</topic><topic>Population structure</topic><topic>Positive selection</topic><topic>Principal Component Analysis</topic><topic>Principal components analysis</topic><topic>Research and Analysis Methods</topic><topic>Single-nucleotide polymorphism</topic><topic>Toll-Like Receptor 4 - genetics</topic><topic>United Kingdom - ethnology</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Agrawal, Aman</creatorcontrib><creatorcontrib>Chiu, Alec M</creatorcontrib><creatorcontrib>Le, Minh</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Sankararaman, Sriram</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Canada</collection><collection>Gale in Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>Biological Science Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>PLoS genetics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Agrawal, Aman</au><au>Chiu, Alec M</au><au>Le, Minh</au><au>Halperin, Eran</au><au>Sankararaman, Sriram</au><au>Gravel, Simon</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable probabilistic PCA for large-scale genetic variation data</atitle><jtitle>PLoS genetics</jtitle><addtitle>PLoS Genet</addtitle><date>2020-05-29</date><risdate>2020</risdate><volume>16</volume><issue>5</issue><spage>e1008773</spage><epage>e1008773</epage><pages>e1008773-e1008773</pages><issn>1553-7404</issn><issn>1553-7390</issn><eissn>1553-7404</eissn><abstract>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>32469896</pmid><doi>10.1371/journal.pgen.1008773</doi><orcidid>https://orcid.org/0000-0002-5955-9701</orcidid><orcidid>https://orcid.org/0000-0002-1646-1149</orcidid><orcidid>https://orcid.org/0000-0002-2373-3691</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1553-7404
ispartof PLoS genetics, 2020-05, Vol.16 (5), p.e1008773-e1008773
issn 1553-7404
1553-7390
1553-7404
language eng
recordid cdi_plos_journals_2479455466
source Publicly Available Content (ProQuest); PubMed Central
subjects Accuracy
Adaptor Proteins, Signal Transducing - genetics
Algorithms
Biobanks
Biological Specimen Banks
Biology and Life Sciences
Computational Biology - methods
Computer applications
Datasets
European Continental Ancestry Group - genetics
Genetic analysis
Genetic diversity
Genetic variation
Genetics, Population
Genome-wide association studies
Genome-Wide Association Study - methods
Genomes
Genotype & phenotype
Genotypes
Heritability
Humans
Linkage disequilibrium
Methods
Models, Genetic
Mutation, Missense
Physical Sciences
Polymorphism, Single Nucleotide
Population
Population genetics
Population structure
Positive selection
Principal Component Analysis
Principal components analysis
Research and Analysis Methods
Single-nucleotide polymorphism
Toll-Like Receptor 4 - genetics
United Kingdom - ethnology
title Scalable probabilistic PCA for large-scale genetic variation data
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T13%3A35%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20probabilistic%20PCA%20for%20large-scale%20genetic%20variation%20data&rft.jtitle=PLoS%20genetics&rft.au=Agrawal,%20Aman&rft.date=2020-05-29&rft.volume=16&rft.issue=5&rft.spage=e1008773&rft.epage=e1008773&rft.pages=e1008773-e1008773&rft.issn=1553-7404&rft.eissn=1553-7404&rft_id=info:doi/10.1371/journal.pgen.1008773&rft_dat=%3Cgale_plos_%3EA632949411%3C/gale_plos_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2479455466&rft_id=info:pmid/32469896&rft_galeid=A632949411&rfr_iscdi=true