Loading…
Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection
Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the...
Saved in:
Published in: | Journal of the Royal Statistical Society. Series B, Statistical methodology Statistical methodology, 2018-06, Vol.80 (3), p.551-577 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3 |
---|---|
cites | cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3 |
container_end_page | 577 |
container_issue | 3 |
container_start_page | 551 |
container_title | Journal of the Royal Statistical Society. Series B, Statistical methodology |
container_volume | 80 |
creator | Candès, Emmanuel Fan, Yingying Janson, Lucas Lv, Jinchi |
description | Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data. |
doi_str_mv | 10.1111/rssb.12265 |
format | article |
fullrecord | <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_2029235671</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>26773168</jstor_id><sourcerecordid>26773168</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</originalsourceid><addsrcrecordid>eNp9j01LAzEQhoMoWKsg3oWCN2FrJtnNx1GLX1BQrJ5Dkk3KLuumJi3Sf2_qqkffy8zheWd4EDoDPIWcq5iSmQIhrNpDIygZL6RgYj_vlMmCl0AO0VFKLc5hnI7Q6bPu-6ZfTnyIk2Xo6mN04HWX3MnPHKO3u9vX2UMxf7p_nF3PC1tWoipKjp0nUpqa1gyscRoTMNZDbSQQSbmn1jJjNRHG4MrUnBnhNDG4FNqVno7RxXB3FcPHxqW1asMm9vmlIphIQivGIVOXA2VjSCk6r1axeddxqwCrnbHaGatv4wzDAH82ndv-Q6qXxeLmt3M-dNq0DvGvQxjnFJigX5pkYKg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2029235671</pqid></control><display><type>article</type><title>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</title><source>International Bibliography of the Social Sciences (IBSS)</source><source>Business Source Ultimate</source><source>JSTOR Archival Journals and Primary Sources Collection</source><source>Alma/SFX Local Collection</source><creator>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</creator><creatorcontrib>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</creatorcontrib><description>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</description><identifier>ISSN: 1369-7412</identifier><identifier>EISSN: 1467-9868</identifier><identifier>DOI: 10.1111/rssb.12265</identifier><language>eng</language><publisher>Oxford: Wiley</publisher><subject>Competitors ; Computer simulation ; Crohn's Disease ; Discovery ; Disease control ; False discovery rate ; Generalized linear models ; Genomewide association study ; Gold ; Inference ; Innovations ; Knockoff filter ; Linear analysis ; Logistic regression ; Markov blanket ; Power ; Regression analysis ; Statistical methods ; Statistics ; Testing for conditional independence in non‐linear models</subject><ispartof>Journal of the Royal Statistical Society. Series B, Statistical methodology, 2018-06, Vol.80 (3), p.551-577</ispartof><rights>2018 Royal Statistical Society</rights><rights>Copyright © 2018 The Royal Statistical Society and Blackwell Publishing Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</citedby><cites>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/26773168$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/26773168$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,33223,58238,58471</link.rule.ids></links><search><creatorcontrib>Candès, Emmanuel</creatorcontrib><creatorcontrib>Fan, Yingying</creatorcontrib><creatorcontrib>Janson, Lucas</creatorcontrib><creatorcontrib>Lv, Jinchi</creatorcontrib><title>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</title><title>Journal of the Royal Statistical Society. Series B, Statistical methodology</title><description>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</description><subject>Competitors</subject><subject>Computer simulation</subject><subject>Crohn's Disease</subject><subject>Discovery</subject><subject>Disease control</subject><subject>False discovery rate</subject><subject>Generalized linear models</subject><subject>Genomewide association study</subject><subject>Gold</subject><subject>Inference</subject><subject>Innovations</subject><subject>Knockoff filter</subject><subject>Linear analysis</subject><subject>Logistic regression</subject><subject>Markov blanket</subject><subject>Power</subject><subject>Regression analysis</subject><subject>Statistical methods</subject><subject>Statistics</subject><subject>Testing for conditional independence in non‐linear models</subject><issn>1369-7412</issn><issn>1467-9868</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>8BJ</sourceid><recordid>eNp9j01LAzEQhoMoWKsg3oWCN2FrJtnNx1GLX1BQrJ5Dkk3KLuumJi3Sf2_qqkffy8zheWd4EDoDPIWcq5iSmQIhrNpDIygZL6RgYj_vlMmCl0AO0VFKLc5hnI7Q6bPu-6ZfTnyIk2Xo6mN04HWX3MnPHKO3u9vX2UMxf7p_nF3PC1tWoipKjp0nUpqa1gyscRoTMNZDbSQQSbmn1jJjNRHG4MrUnBnhNDG4FNqVno7RxXB3FcPHxqW1asMm9vmlIphIQivGIVOXA2VjSCk6r1axeddxqwCrnbHaGatv4wzDAH82ndv-Q6qXxeLmt3M-dNq0DvGvQxjnFJigX5pkYKg</recordid><startdate>201806</startdate><enddate>201806</enddate><creator>Candès, Emmanuel</creator><creator>Fan, Yingying</creator><creator>Janson, Lucas</creator><creator>Lv, Jinchi</creator><general>Wiley</general><general>Oxford University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8BJ</scope><scope>8FD</scope><scope>FQK</scope><scope>JBE</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201806</creationdate><title>Panning for gold</title><author>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Competitors</topic><topic>Computer simulation</topic><topic>Crohn's Disease</topic><topic>Discovery</topic><topic>Disease control</topic><topic>False discovery rate</topic><topic>Generalized linear models</topic><topic>Genomewide association study</topic><topic>Gold</topic><topic>Inference</topic><topic>Innovations</topic><topic>Knockoff filter</topic><topic>Linear analysis</topic><topic>Logistic regression</topic><topic>Markov blanket</topic><topic>Power</topic><topic>Regression analysis</topic><topic>Statistical methods</topic><topic>Statistics</topic><topic>Testing for conditional independence in non‐linear models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Candès, Emmanuel</creatorcontrib><creatorcontrib>Fan, Yingying</creatorcontrib><creatorcontrib>Janson, Lucas</creatorcontrib><creatorcontrib>Lv, Jinchi</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>Technology Research Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of the Royal Statistical Society. Series B, Statistical methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Candès, Emmanuel</au><au>Fan, Yingying</au><au>Janson, Lucas</au><au>Lv, Jinchi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</atitle><jtitle>Journal of the Royal Statistical Society. Series B, Statistical methodology</jtitle><date>2018-06</date><risdate>2018</risdate><volume>80</volume><issue>3</issue><spage>551</spage><epage>577</epage><pages>551-577</pages><issn>1369-7412</issn><eissn>1467-9868</eissn><abstract>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</abstract><cop>Oxford</cop><pub>Wiley</pub><doi>10.1111/rssb.12265</doi><tpages>27</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1369-7412 |
ispartof | Journal of the Royal Statistical Society. Series B, Statistical methodology, 2018-06, Vol.80 (3), p.551-577 |
issn | 1369-7412 1467-9868 |
language | eng |
recordid | cdi_proquest_journals_2029235671 |
source | International Bibliography of the Social Sciences (IBSS); Business Source Ultimate; JSTOR Archival Journals and Primary Sources Collection; Alma/SFX Local Collection |
subjects | Competitors Computer simulation Crohn's Disease Discovery Disease control False discovery rate Generalized linear models Genomewide association study Gold Inference Innovations Knockoff filter Linear analysis Logistic regression Markov blanket Power Regression analysis Statistical methods Statistics Testing for conditional independence in non‐linear models |
title | Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T08%3A00%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Panning%20for%20gold:%20%E2%80%98model-X%E2%80%99%20knockoffs%20for%20high%20dimensional%20controlled%20variable%20selection&rft.jtitle=Journal%20of%20the%20Royal%20Statistical%20Society.%20Series%20B,%20Statistical%20methodology&rft.au=Cand%C3%A8s,%20Emmanuel&rft.date=2018-06&rft.volume=80&rft.issue=3&rft.spage=551&rft.epage=577&rft.pages=551-577&rft.issn=1369-7412&rft.eissn=1467-9868&rft_id=info:doi/10.1111/rssb.12265&rft_dat=%3Cjstor_proqu%3E26773168%3C/jstor_proqu%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2029235671&rft_id=info:pmid/&rft_jstor_id=26773168&rfr_iscdi=true |