Loading…

Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection

Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the...

Full description

Saved in:
Bibliographic Details
Published in:Journal of the Royal Statistical Society. Series B, Statistical methodology Statistical methodology, 2018-06, Vol.80 (3), p.551-577
Main Authors: Candès, Emmanuel, Fan, Yingying, Janson, Lucas, Lv, Jinchi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3
cites cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3
container_end_page 577
container_issue 3
container_start_page 551
container_title Journal of the Royal Statistical Society. Series B, Statistical methodology
container_volume 80
creator Candès, Emmanuel
Fan, Yingying
Janson, Lucas
Lv, Jinchi
description Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.
doi_str_mv 10.1111/rssb.12265
format article
fullrecord <record><control><sourceid>jstor_proqu</sourceid><recordid>TN_cdi_proquest_journals_2029235671</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>26773168</jstor_id><sourcerecordid>26773168</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</originalsourceid><addsrcrecordid>eNp9j01LAzEQhoMoWKsg3oWCN2FrJtnNx1GLX1BQrJ5Dkk3KLuumJi3Sf2_qqkffy8zheWd4EDoDPIWcq5iSmQIhrNpDIygZL6RgYj_vlMmCl0AO0VFKLc5hnI7Q6bPu-6ZfTnyIk2Xo6mN04HWX3MnPHKO3u9vX2UMxf7p_nF3PC1tWoipKjp0nUpqa1gyscRoTMNZDbSQQSbmn1jJjNRHG4MrUnBnhNDG4FNqVno7RxXB3FcPHxqW1asMm9vmlIphIQivGIVOXA2VjSCk6r1axeddxqwCrnbHaGatv4wzDAH82ndv-Q6qXxeLmt3M-dNq0DvGvQxjnFJigX5pkYKg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2029235671</pqid></control><display><type>article</type><title>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</title><source>International Bibliography of the Social Sciences (IBSS)</source><source>Business Source Ultimate</source><source>JSTOR Archival Journals and Primary Sources Collection</source><source>Alma/SFX Local Collection</source><creator>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</creator><creatorcontrib>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</creatorcontrib><description>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</description><identifier>ISSN: 1369-7412</identifier><identifier>EISSN: 1467-9868</identifier><identifier>DOI: 10.1111/rssb.12265</identifier><language>eng</language><publisher>Oxford: Wiley</publisher><subject>Competitors ; Computer simulation ; Crohn's Disease ; Discovery ; Disease control ; False discovery rate ; Generalized linear models ; Genomewide association study ; Gold ; Inference ; Innovations ; Knockoff filter ; Linear analysis ; Logistic regression ; Markov blanket ; Power ; Regression analysis ; Statistical methods ; Statistics ; Testing for conditional independence in non‐linear models</subject><ispartof>Journal of the Royal Statistical Society. Series B, Statistical methodology, 2018-06, Vol.80 (3), p.551-577</ispartof><rights>2018 Royal Statistical Society</rights><rights>Copyright © 2018 The Royal Statistical Society and Blackwell Publishing Ltd</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</citedby><cites>FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/26773168$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/26773168$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,33223,58238,58471</link.rule.ids></links><search><creatorcontrib>Candès, Emmanuel</creatorcontrib><creatorcontrib>Fan, Yingying</creatorcontrib><creatorcontrib>Janson, Lucas</creatorcontrib><creatorcontrib>Lv, Jinchi</creatorcontrib><title>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</title><title>Journal of the Royal Statistical Society. Series B, Statistical methodology</title><description>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</description><subject>Competitors</subject><subject>Computer simulation</subject><subject>Crohn's Disease</subject><subject>Discovery</subject><subject>Disease control</subject><subject>False discovery rate</subject><subject>Generalized linear models</subject><subject>Genomewide association study</subject><subject>Gold</subject><subject>Inference</subject><subject>Innovations</subject><subject>Knockoff filter</subject><subject>Linear analysis</subject><subject>Logistic regression</subject><subject>Markov blanket</subject><subject>Power</subject><subject>Regression analysis</subject><subject>Statistical methods</subject><subject>Statistics</subject><subject>Testing for conditional independence in non‐linear models</subject><issn>1369-7412</issn><issn>1467-9868</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>8BJ</sourceid><recordid>eNp9j01LAzEQhoMoWKsg3oWCN2FrJtnNx1GLX1BQrJ5Dkk3KLuumJi3Sf2_qqkffy8zheWd4EDoDPIWcq5iSmQIhrNpDIygZL6RgYj_vlMmCl0AO0VFKLc5hnI7Q6bPu-6ZfTnyIk2Xo6mN04HWX3MnPHKO3u9vX2UMxf7p_nF3PC1tWoipKjp0nUpqa1gyscRoTMNZDbSQQSbmn1jJjNRHG4MrUnBnhNDG4FNqVno7RxXB3FcPHxqW1asMm9vmlIphIQivGIVOXA2VjSCk6r1axeddxqwCrnbHaGatv4wzDAH82ndv-Q6qXxeLmt3M-dNq0DvGvQxjnFJigX5pkYKg</recordid><startdate>201806</startdate><enddate>201806</enddate><creator>Candès, Emmanuel</creator><creator>Fan, Yingying</creator><creator>Janson, Lucas</creator><creator>Lv, Jinchi</creator><general>Wiley</general><general>Oxford University Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8BJ</scope><scope>8FD</scope><scope>FQK</scope><scope>JBE</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201806</creationdate><title>Panning for gold</title><author>Candès, Emmanuel ; Fan, Yingying ; Janson, Lucas ; Lv, Jinchi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Competitors</topic><topic>Computer simulation</topic><topic>Crohn's Disease</topic><topic>Discovery</topic><topic>Disease control</topic><topic>False discovery rate</topic><topic>Generalized linear models</topic><topic>Genomewide association study</topic><topic>Gold</topic><topic>Inference</topic><topic>Innovations</topic><topic>Knockoff filter</topic><topic>Linear analysis</topic><topic>Logistic regression</topic><topic>Markov blanket</topic><topic>Power</topic><topic>Regression analysis</topic><topic>Statistical methods</topic><topic>Statistics</topic><topic>Testing for conditional independence in non‐linear models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Candès, Emmanuel</creatorcontrib><creatorcontrib>Fan, Yingying</creatorcontrib><creatorcontrib>Janson, Lucas</creatorcontrib><creatorcontrib>Lv, Jinchi</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>Technology Research Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of the Royal Statistical Society. Series B, Statistical methodology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Candès, Emmanuel</au><au>Fan, Yingying</au><au>Janson, Lucas</au><au>Lv, Jinchi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection</atitle><jtitle>Journal of the Royal Statistical Society. Series B, Statistical methodology</jtitle><date>2018-06</date><risdate>2018</risdate><volume>80</volume><issue>3</issue><spage>551</spage><epage>577</epage><pages>551-577</pages><issn>1369-7412</issn><eissn>1467-9868</eissn><abstract>Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n ⩾ p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn’s disease in the UK, making twice as many discoveries as the original analysis of the same data.</abstract><cop>Oxford</cop><pub>Wiley</pub><doi>10.1111/rssb.12265</doi><tpages>27</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1369-7412
ispartof Journal of the Royal Statistical Society. Series B, Statistical methodology, 2018-06, Vol.80 (3), p.551-577
issn 1369-7412
1467-9868
language eng
recordid cdi_proquest_journals_2029235671
source International Bibliography of the Social Sciences (IBSS); Business Source Ultimate; JSTOR Archival Journals and Primary Sources Collection; Alma/SFX Local Collection
subjects Competitors
Computer simulation
Crohn's Disease
Discovery
Disease control
False discovery rate
Generalized linear models
Genomewide association study
Gold
Inference
Innovations
Knockoff filter
Linear analysis
Logistic regression
Markov blanket
Power
Regression analysis
Statistical methods
Statistics
Testing for conditional independence in non‐linear models
title Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T08%3A00%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Panning%20for%20gold:%20%E2%80%98model-X%E2%80%99%20knockoffs%20for%20high%20dimensional%20controlled%20variable%20selection&rft.jtitle=Journal%20of%20the%20Royal%20Statistical%20Society.%20Series%20B,%20Statistical%20methodology&rft.au=Cand%C3%A8s,%20Emmanuel&rft.date=2018-06&rft.volume=80&rft.issue=3&rft.spage=551&rft.epage=577&rft.pages=551-577&rft.issn=1369-7412&rft.eissn=1467-9868&rft_id=info:doi/10.1111/rssb.12265&rft_dat=%3Cjstor_proqu%3E26773168%3C/jstor_proqu%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c4585-470ef299bd3d61cbea021bcf1db912937f3cc6bca28bb05bd76b8ea2b048ae4f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2029235671&rft_id=info:pmid/&rft_jstor_id=26773168&rfr_iscdi=true