Loading…
Scalable probabilistic PCA for large-scale genetic variation data
Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal compon...
Saved in:
Published in: | PLoS genetics 2020-05, Vol.16 (5), p.e1008773-e1008773 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353 |
---|---|
cites | cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353 |
container_end_page | e1008773 |
container_issue | 5 |
container_start_page | e1008773 |
container_title | PLoS genetics |
container_volume | 16 |
creator | Agrawal, Aman Chiu, Alec M Le, Minh Halperin, Eran Sankararaman, Sriram |
description | Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4. |
doi_str_mv | 10.1371/journal.pgen.1008773 |
format | article |
fullrecord | <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2479455466</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A632949411</galeid><doaj_id>oai_doaj_org_article_e5d0f67472ce459d9b3bfe1aef6cc88e</doaj_id><sourcerecordid>A632949411</sourcerecordid><originalsourceid>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</originalsourceid><addsrcrecordid>eNqVk99v0zAQxyMEYmPwHyCohITgocWJ7Th5QaoqflSaGGITr9bFOaeu3Lizkwn-e5w1mxq0B5AfbJ0_9707ny9JXqZkkVKRfti63rdgF_sG20VKSCEEfZScppzTuWCEPT46nyTPQtgSQnlRiqfJCc1YXhZlfposLxVYqCzO9t5VUBlrQmfU7PtqOdPOzyz4BuchQjiLgXC4uwFvoDOundXQwfPkiQYb8MW4nyVXnz9drb7Ozy--rFfL87kSZdbNqaIZpwIIg0IBQl2moKtcMMF1jkUNVFdlSkTGal4JlbKs0jlDQTHnmnJ6lrw-yO6tC3IsPsiMiZJxzvI8EusDUTvYyr03O_C_pQMjbw3ONxJ8zN-iRF4TPcTOFDJe1mVFK40poM6VKgqMWh_HaH21w1ph23mwE9HpTWs2snE3UmRFzm_TfTcKeHfdY-jkzgSF1kKLrh_yJkVGGBM0om_-Qh-ubqSa2AppWu1iXDWIymVOs5KVLE0jtXiAiqvGnVGuRW2ifeLwfuIQmQ5_dQ30Icj15Y__YL_9O3vxc8q-PWI3CLbbBGf74YuFKcgOoPIuBI_6viEpkcNQ3L2cHIZCjkMR3V4dN_Pe6W4K6B9TgAWb</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2479455466</pqid></control><display><type>article</type><title>Scalable probabilistic PCA for large-scale genetic variation data</title><source>Publicly Available Content (ProQuest)</source><source>PubMed Central</source><creator>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram</creator><contributor>Gravel, Simon</contributor><creatorcontrib>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram ; Gravel, Simon</creatorcontrib><description>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</description><identifier>ISSN: 1553-7404</identifier><identifier>ISSN: 1553-7390</identifier><identifier>EISSN: 1553-7404</identifier><identifier>DOI: 10.1371/journal.pgen.1008773</identifier><identifier>PMID: 32469896</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Accuracy ; Adaptor Proteins, Signal Transducing - genetics ; Algorithms ; Biobanks ; Biological Specimen Banks ; Biology and Life Sciences ; Computational Biology - methods ; Computer applications ; Datasets ; European Continental Ancestry Group - genetics ; Genetic analysis ; Genetic diversity ; Genetic variation ; Genetics, Population ; Genome-wide association studies ; Genome-Wide Association Study - methods ; Genomes ; Genotype & phenotype ; Genotypes ; Heritability ; Humans ; Linkage disequilibrium ; Methods ; Models, Genetic ; Mutation, Missense ; Physical Sciences ; Polymorphism, Single Nucleotide ; Population ; Population genetics ; Population structure ; Positive selection ; Principal Component Analysis ; Principal components analysis ; Research and Analysis Methods ; Single-nucleotide polymorphism ; Toll-Like Receptor 4 - genetics ; United Kingdom - ethnology</subject><ispartof>PLoS genetics, 2020-05, Vol.16 (5), p.e1008773-e1008773</ispartof><rights>COPYRIGHT 2020 Public Library of Science</rights><rights>2020 Agrawal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 Agrawal et al 2020 Agrawal et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</citedby><cites>FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</cites><orcidid>0000-0002-5955-9701 ; 0000-0002-1646-1149 ; 0000-0002-2373-3691</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2479455466/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2479455466?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,886,25754,27925,27926,37013,37014,44591,53792,53794,75127</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32469896$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Gravel, Simon</contributor><creatorcontrib>Agrawal, Aman</creatorcontrib><creatorcontrib>Chiu, Alec M</creatorcontrib><creatorcontrib>Le, Minh</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Sankararaman, Sriram</creatorcontrib><title>Scalable probabilistic PCA for large-scale genetic variation data</title><title>PLoS genetics</title><addtitle>PLoS Genet</addtitle><description>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</description><subject>Accuracy</subject><subject>Adaptor Proteins, Signal Transducing - genetics</subject><subject>Algorithms</subject><subject>Biobanks</subject><subject>Biological Specimen Banks</subject><subject>Biology and Life Sciences</subject><subject>Computational Biology - methods</subject><subject>Computer applications</subject><subject>Datasets</subject><subject>European Continental Ancestry Group - genetics</subject><subject>Genetic analysis</subject><subject>Genetic diversity</subject><subject>Genetic variation</subject><subject>Genetics, Population</subject><subject>Genome-wide association studies</subject><subject>Genome-Wide Association Study - methods</subject><subject>Genomes</subject><subject>Genotype & phenotype</subject><subject>Genotypes</subject><subject>Heritability</subject><subject>Humans</subject><subject>Linkage disequilibrium</subject><subject>Methods</subject><subject>Models, Genetic</subject><subject>Mutation, Missense</subject><subject>Physical Sciences</subject><subject>Polymorphism, Single Nucleotide</subject><subject>Population</subject><subject>Population genetics</subject><subject>Population structure</subject><subject>Positive selection</subject><subject>Principal Component Analysis</subject><subject>Principal components analysis</subject><subject>Research and Analysis Methods</subject><subject>Single-nucleotide polymorphism</subject><subject>Toll-Like Receptor 4 - genetics</subject><subject>United Kingdom - ethnology</subject><issn>1553-7404</issn><issn>1553-7390</issn><issn>1553-7404</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNqVk99v0zAQxyMEYmPwHyCohITgocWJ7Th5QaoqflSaGGITr9bFOaeu3Lizkwn-e5w1mxq0B5AfbJ0_9707ny9JXqZkkVKRfti63rdgF_sG20VKSCEEfZScppzTuWCEPT46nyTPQtgSQnlRiqfJCc1YXhZlfposLxVYqCzO9t5VUBlrQmfU7PtqOdPOzyz4BuchQjiLgXC4uwFvoDOundXQwfPkiQYb8MW4nyVXnz9drb7Ozy--rFfL87kSZdbNqaIZpwIIg0IBQl2moKtcMMF1jkUNVFdlSkTGal4JlbKs0jlDQTHnmnJ6lrw-yO6tC3IsPsiMiZJxzvI8EusDUTvYyr03O_C_pQMjbw3ONxJ8zN-iRF4TPcTOFDJe1mVFK40poM6VKgqMWh_HaH21w1ph23mwE9HpTWs2snE3UmRFzm_TfTcKeHfdY-jkzgSF1kKLrh_yJkVGGBM0om_-Qh-ubqSa2AppWu1iXDWIymVOs5KVLE0jtXiAiqvGnVGuRW2ifeLwfuIQmQ5_dQ30Icj15Y__YL_9O3vxc8q-PWI3CLbbBGf74YuFKcgOoPIuBI_6viEpkcNQ3L2cHIZCjkMR3V4dN_Pe6W4K6B9TgAWb</recordid><startdate>20200529</startdate><enddate>20200529</enddate><creator>Agrawal, Aman</creator><creator>Chiu, Alec M</creator><creator>Le, Minh</creator><creator>Halperin, Eran</creator><creator>Sankararaman, Sriram</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QP</scope><scope>7QR</scope><scope>7SS</scope><scope>7TK</scope><scope>7TM</scope><scope>7TO</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FD</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-5955-9701</orcidid><orcidid>https://orcid.org/0000-0002-1646-1149</orcidid><orcidid>https://orcid.org/0000-0002-2373-3691</orcidid></search><sort><creationdate>20200529</creationdate><title>Scalable probabilistic PCA for large-scale genetic variation data</title><author>Agrawal, Aman ; Chiu, Alec M ; Le, Minh ; Halperin, Eran ; Sankararaman, Sriram</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Adaptor Proteins, Signal Transducing - genetics</topic><topic>Algorithms</topic><topic>Biobanks</topic><topic>Biological Specimen Banks</topic><topic>Biology and Life Sciences</topic><topic>Computational Biology - methods</topic><topic>Computer applications</topic><topic>Datasets</topic><topic>European Continental Ancestry Group - genetics</topic><topic>Genetic analysis</topic><topic>Genetic diversity</topic><topic>Genetic variation</topic><topic>Genetics, Population</topic><topic>Genome-wide association studies</topic><topic>Genome-Wide Association Study - methods</topic><topic>Genomes</topic><topic>Genotype & phenotype</topic><topic>Genotypes</topic><topic>Heritability</topic><topic>Humans</topic><topic>Linkage disequilibrium</topic><topic>Methods</topic><topic>Models, Genetic</topic><topic>Mutation, Missense</topic><topic>Physical Sciences</topic><topic>Polymorphism, Single Nucleotide</topic><topic>Population</topic><topic>Population genetics</topic><topic>Population structure</topic><topic>Positive selection</topic><topic>Principal Component Analysis</topic><topic>Principal components analysis</topic><topic>Research and Analysis Methods</topic><topic>Single-nucleotide polymorphism</topic><topic>Toll-Like Receptor 4 - genetics</topic><topic>United Kingdom - ethnology</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Agrawal, Aman</creatorcontrib><creatorcontrib>Chiu, Alec M</creatorcontrib><creatorcontrib>Le, Minh</creatorcontrib><creatorcontrib>Halperin, Eran</creatorcontrib><creatorcontrib>Sankararaman, Sriram</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Canada</collection><collection>Gale in Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>Biological Science Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>PLoS genetics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Agrawal, Aman</au><au>Chiu, Alec M</au><au>Le, Minh</au><au>Halperin, Eran</au><au>Sankararaman, Sriram</au><au>Gravel, Simon</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scalable probabilistic PCA for large-scale genetic variation data</atitle><jtitle>PLoS genetics</jtitle><addtitle>PLoS Genet</addtitle><date>2020-05-29</date><risdate>2020</risdate><volume>16</volume><issue>5</issue><spage>e1008773</spage><epage>e1008773</epage><pages>e1008773-e1008773</pages><issn>1553-7404</issn><issn>1553-7390</issn><eissn>1553-7404</eissn><abstract>Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>32469896</pmid><doi>10.1371/journal.pgen.1008773</doi><orcidid>https://orcid.org/0000-0002-5955-9701</orcidid><orcidid>https://orcid.org/0000-0002-1646-1149</orcidid><orcidid>https://orcid.org/0000-0002-2373-3691</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1553-7404 |
ispartof | PLoS genetics, 2020-05, Vol.16 (5), p.e1008773-e1008773 |
issn | 1553-7404 1553-7390 1553-7404 |
language | eng |
recordid | cdi_plos_journals_2479455466 |
source | Publicly Available Content (ProQuest); PubMed Central |
subjects | Accuracy Adaptor Proteins, Signal Transducing - genetics Algorithms Biobanks Biological Specimen Banks Biology and Life Sciences Computational Biology - methods Computer applications Datasets European Continental Ancestry Group - genetics Genetic analysis Genetic diversity Genetic variation Genetics, Population Genome-wide association studies Genome-Wide Association Study - methods Genomes Genotype & phenotype Genotypes Heritability Humans Linkage disequilibrium Methods Models, Genetic Mutation, Missense Physical Sciences Polymorphism, Single Nucleotide Population Population genetics Population structure Positive selection Principal Component Analysis Principal components analysis Research and Analysis Methods Single-nucleotide polymorphism Toll-Like Receptor 4 - genetics United Kingdom - ethnology |
title | Scalable probabilistic PCA for large-scale genetic variation data |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T13%3A35%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scalable%20probabilistic%20PCA%20for%20large-scale%20genetic%20variation%20data&rft.jtitle=PLoS%20genetics&rft.au=Agrawal,%20Aman&rft.date=2020-05-29&rft.volume=16&rft.issue=5&rft.spage=e1008773&rft.epage=e1008773&rft.pages=e1008773-e1008773&rft.issn=1553-7404&rft.eissn=1553-7404&rft_id=info:doi/10.1371/journal.pgen.1008773&rft_dat=%3Cgale_plos_%3EA632949411%3C/gale_plos_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c792t-3c32537a04a8caead91afb67475f6e8da3fb910724d5b7c142bf64e73e65f353%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2479455466&rft_id=info:pmid/32469896&rft_galeid=A632949411&rfr_iscdi=true |