Loading…

Inference after variable selection using restricted permutation methods

When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the m...

Full description

Saved in:

Bibliographic Details
Published in:	Canadian journal of statistics 2009-12, Vol.37 (4), p.625-644
Main Authors:	Wang, Rui, Lagakos, Stephen W.
Format:	Article
Language:	English
Subjects:	AIDS Comparative analysis Confidence interval covariates Data analysis Datasets Economic models Economic statistics Empirical research Estimating techniques Feature selection Inference Information economics Linear models Linear regression Marginal analysis Mathematical economics Mathematical independent variables Measurement MSC 2000: Primary 62G09 Parametric models Permutation tests Probabilities regression Restricted permutation methods sample splitting secondary 62J05 Simulation Simulation techniques Statistical analysis Statistical data Statistical inference Statistical methods Statistical models Statistical tables Studies Variable selector Variable-selection algorithm Variance analysis
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3
cites	cdi_FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3
container_end_page	644
container_issue	4
container_start_page	625
container_title	Canadian journal of statistics
container_volume	37
creator	Wang, Rui Lagakos, Stephen W.
description	When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard ("naive") methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study. Lorsque le statisticien doit choisir entre plusieurs covariables et qu'il n'a qu'une seule variable réponse, il doit souvent appliquer un algorithme de sélection de variables aux jeux de données afin d'identifier un sous-ensemble de covariables potentiellement associées à la variable réponse. Par la suite, il peut faire l'inférence sur les paramètres d'un modèle de l'association marginale entre les covariables choisies et la variable réponse. Si un autre jeu de données indépendant était disponible, les paramètres d'intérêt pourraient être estimés par les méthodes d'inférence courantes pour ajuster le modèle marginal considéré à ce jeu de données. Cependant, lorsque ces méthodes d'inférence sont utilisées sur le jeu de données qui a servi à la sélection des covariables, ces méthodes "naïves" peuvent produire un biais d'estimation. Les auteurs ont développé des tests et des méthodes d'estimation par intervalle pour les paramètres tenant compte de l'association marginale entre les covariables sélectionnés et la variable réponse, en utilisant les mêmes données qui ont servi à choisir les covariables. Ils donnent une justification théorique pour les méthodes proposées, présentent des résultats pour guider leur implantation, et ils utilisent des simulations pour mesu-rer et comparer leur performance
doi_str_mv	10.1002/cjs.10039
format	article
fullrecord	<record><control><sourceid>jstor_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_2848082</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>25653502</jstor_id><sourcerecordid>25653502</sourcerecordid><originalsourceid>FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3</originalsourceid><addsrcrecordid>eNqNkktv1DAUhS1ERaeFBT8AFLEpXYT6ldjeINERDK1aWDA8dpbj3LSe5jHYSaH_Hqdph4dUwBtbOp-PfI8PQo8JfkEwpgd2FcYDU_fQjAgsU8WzL_fRDDOi0kxQvo12QlhFIiOEPkDbFLNcilzO0OKorcBDayExVQ8-uTTemaKGJEANtnddmwzBtWeJh9B7Z3sokzX4ZujNtdhAf96V4SHaqkwd4NHNvos-vnm9nL9NT94vjuavTlKbKanSMhMVVJVUeaEMN1CSEkvCWVFRCpQwkWHCOFisjClMKWluCLUlFyVXMhfAdtHLyXc9FA2UFtrem1qvvWuMv9Kdcfp3pXXn-qy71FRyiSWNBns3Br77OsSZdOOChbo2LXRD0JLFxykh2f-RMufin6TgTGAaA4_k87-SRLIsrjwbTZ_9ga66wbcxXE3jpzPO1AjtT5D1XQgeqk0SBOuxGzp2Q193I7JPf41uQ96WIQIHE_DN1XB1t5OeH3-4tXwy3ViFvvM_HeMALMNj2Omku9DD941u_IXORfxs_fndQh8eq_mn5elSH7IflpfcRg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>200234397</pqid></control><display><type>article</type><title>Inference after variable selection using restricted permutation methods</title><source>International Bibliography of the Social Sciences (IBSS)</source><source>JSTOR Archival Journals and Primary Sources Collection</source><source>Wiley-Blackwell Read & Publish Collection</source><creator>Wang, Rui ; Lagakos, Stephen W.</creator><creatorcontrib>Wang, Rui ; Lagakos, Stephen W.</creatorcontrib><description>When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard ("naive") methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study. Lorsque le statisticien doit choisir entre plusieurs covariables et qu'il n'a qu'une seule variable réponse, il doit souvent appliquer un algorithme de sélection de variables aux jeux de données afin d'identifier un sous-ensemble de covariables potentiellement associées à la variable réponse. Par la suite, il peut faire l'inférence sur les paramètres d'un modèle de l'association marginale entre les covariables choisies et la variable réponse. Si un autre jeu de données indépendant était disponible, les paramètres d'intérêt pourraient être estimés par les méthodes d'inférence courantes pour ajuster le modèle marginal considéré à ce jeu de données. Cependant, lorsque ces méthodes d'inférence sont utilisées sur le jeu de données qui a servi à la sélection des covariables, ces méthodes "naïves" peuvent produire un biais d'estimation. Les auteurs ont développé des tests et des méthodes d'estimation par intervalle pour les paramètres tenant compte de l'association marginale entre les covariables sélectionnés et la variable réponse, en utilisant les mêmes données qui ont servi à choisir les covariables. Ils donnent une justification théorique pour les méthodes proposées, présentent des résultats pour guider leur implantation, et ils utilisent des simulations pour mesu-rer et comparer leur performance par rapport à l'approche consistant à diviser l'échantillon. Ces méthodes sont appliquées à des données provenant d'une récente étude sur le SIDA.</description><identifier>ISSN: 0319-5724</identifier><identifier>EISSN: 1708-945X</identifier><identifier>DOI: 10.1002/cjs.10039</identifier><identifier>PMID: 20368768</identifier><language>eng</language><publisher>Hoboken, USA: John Wiley & Sons, Inc</publisher><subject>AIDS ; Comparative analysis ; Confidence interval ; covariates ; Data analysis ; Datasets ; Economic models ; Economic statistics ; Empirical research ; Estimating techniques ; Feature selection ; Inference ; Information economics ; Linear models ; Linear regression ; Marginal analysis ; Mathematical economics ; Mathematical independent variables ; Measurement ; MSC 2000: Primary 62G09 ; Parametric models ; Permutation tests ; Probabilities ; regression ; Restricted permutation methods ; sample splitting ; secondary 62J05 ; Simulation ; Simulation techniques ; Statistical analysis ; Statistical data ; Statistical inference ; Statistical methods ; Statistical models ; Statistical tables ; Studies ; Variable selector ; Variable-selection algorithm ; Variance analysis</subject><ispartof>Canadian journal of statistics, 2009-12, Vol.37 (4), p.625-644</ispartof><rights>2009 Statistical Society of Canada/Société statistique du Canada</rights><rights>Copyright © 2009 Statistical Society of Canada</rights><rights>Copyright Statistical Society of Canada Dec 2009</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3</citedby><cites>FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/25653502$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/25653502$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>230,314,780,784,885,27924,27925,33223,33224,58238,58471</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/20368768$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Rui</creatorcontrib><creatorcontrib>Lagakos, Stephen W.</creatorcontrib><title>Inference after variable selection using restricted permutation methods</title><title>Canadian journal of statistics</title><addtitle>Can J Statistics</addtitle><description>When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard ("naive") methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study. Lorsque le statisticien doit choisir entre plusieurs covariables et qu'il n'a qu'une seule variable réponse, il doit souvent appliquer un algorithme de sélection de variables aux jeux de données afin d'identifier un sous-ensemble de covariables potentiellement associées à la variable réponse. Par la suite, il peut faire l'inférence sur les paramètres d'un modèle de l'association marginale entre les covariables choisies et la variable réponse. Si un autre jeu de données indépendant était disponible, les paramètres d'intérêt pourraient être estimés par les méthodes d'inférence courantes pour ajuster le modèle marginal considéré à ce jeu de données. Cependant, lorsque ces méthodes d'inférence sont utilisées sur le jeu de données qui a servi à la sélection des covariables, ces méthodes "naïves" peuvent produire un biais d'estimation. Les auteurs ont développé des tests et des méthodes d'estimation par intervalle pour les paramètres tenant compte de l'association marginale entre les covariables sélectionnés et la variable réponse, en utilisant les mêmes données qui ont servi à choisir les covariables. Ils donnent une justification théorique pour les méthodes proposées, présentent des résultats pour guider leur implantation, et ils utilisent des simulations pour mesu-rer et comparer leur performance par rapport à l'approche consistant à diviser l'échantillon. Ces méthodes sont appliquées à des données provenant d'une récente étude sur le SIDA.</description><subject>AIDS</subject><subject>Comparative analysis</subject><subject>Confidence interval</subject><subject>covariates</subject><subject>Data analysis</subject><subject>Datasets</subject><subject>Economic models</subject><subject>Economic statistics</subject><subject>Empirical research</subject><subject>Estimating techniques</subject><subject>Feature selection</subject><subject>Inference</subject><subject>Information economics</subject><subject>Linear models</subject><subject>Linear regression</subject><subject>Marginal analysis</subject><subject>Mathematical economics</subject><subject>Mathematical independent variables</subject><subject>Measurement</subject><subject>MSC 2000: Primary 62G09</subject><subject>Parametric models</subject><subject>Permutation tests</subject><subject>Probabilities</subject><subject>regression</subject><subject>Restricted permutation methods</subject><subject>sample splitting</subject><subject>secondary 62J05</subject><subject>Simulation</subject><subject>Simulation techniques</subject><subject>Statistical analysis</subject><subject>Statistical data</subject><subject>Statistical inference</subject><subject>Statistical methods</subject><subject>Statistical models</subject><subject>Statistical tables</subject><subject>Studies</subject><subject>Variable selector</subject><subject>Variable-selection algorithm</subject><subject>Variance analysis</subject><issn>0319-5724</issn><issn>1708-945X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2009</creationdate><recordtype>article</recordtype><sourceid>8BJ</sourceid><recordid>eNqNkktv1DAUhS1ERaeFBT8AFLEpXYT6ldjeINERDK1aWDA8dpbj3LSe5jHYSaH_Hqdph4dUwBtbOp-PfI8PQo8JfkEwpgd2FcYDU_fQjAgsU8WzL_fRDDOi0kxQvo12QlhFIiOEPkDbFLNcilzO0OKorcBDayExVQ8-uTTemaKGJEANtnddmwzBtWeJh9B7Z3sokzX4ZujNtdhAf96V4SHaqkwd4NHNvos-vnm9nL9NT94vjuavTlKbKanSMhMVVJVUeaEMN1CSEkvCWVFRCpQwkWHCOFisjClMKWluCLUlFyVXMhfAdtHLyXc9FA2UFtrem1qvvWuMv9Kdcfp3pXXn-qy71FRyiSWNBns3Br77OsSZdOOChbo2LXRD0JLFxykh2f-RMufin6TgTGAaA4_k87-SRLIsrjwbTZ_9ga66wbcxXE3jpzPO1AjtT5D1XQgeqk0SBOuxGzp2Q193I7JPf41uQ96WIQIHE_DN1XB1t5OeH3-4tXwy3ViFvvM_HeMALMNj2Omku9DD941u_IXORfxs_fndQh8eq_mn5elSH7IflpfcRg</recordid><startdate>200912</startdate><enddate>200912</enddate><creator>Wang, Rui</creator><creator>Lagakos, Stephen W.</creator><general>John Wiley & Sons, Inc</general><general>Statistical Society of Canada</general><general>Wiley Subscription Services, Inc</general><scope>BSCLL</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8BJ</scope><scope>8FD</scope><scope>FQK</scope><scope>H8D</scope><scope>JBE</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>200912</creationdate><title>Inference after variable selection using restricted permutation methods</title><author>Wang, Rui ; Lagakos, Stephen W.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2009</creationdate><topic>AIDS</topic><topic>Comparative analysis</topic><topic>Confidence interval</topic><topic>covariates</topic><topic>Data analysis</topic><topic>Datasets</topic><topic>Economic models</topic><topic>Economic statistics</topic><topic>Empirical research</topic><topic>Estimating techniques</topic><topic>Feature selection</topic><topic>Inference</topic><topic>Information economics</topic><topic>Linear models</topic><topic>Linear regression</topic><topic>Marginal analysis</topic><topic>Mathematical economics</topic><topic>Mathematical independent variables</topic><topic>Measurement</topic><topic>MSC 2000: Primary 62G09</topic><topic>Parametric models</topic><topic>Permutation tests</topic><topic>Probabilities</topic><topic>regression</topic><topic>Restricted permutation methods</topic><topic>sample splitting</topic><topic>secondary 62J05</topic><topic>Simulation</topic><topic>Simulation techniques</topic><topic>Statistical analysis</topic><topic>Statistical data</topic><topic>Statistical inference</topic><topic>Statistical methods</topic><topic>Statistical models</topic><topic>Statistical tables</topic><topic>Studies</topic><topic>Variable selector</topic><topic>Variable-selection algorithm</topic><topic>Variance analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Rui</creatorcontrib><creatorcontrib>Lagakos, Stephen W.</creatorcontrib><collection>Istex</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>International Bibliography of the Social Sciences (IBSS)</collection><collection>Technology Research Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>Aerospace Database</collection><collection>International Bibliography of the Social Sciences</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Canadian journal of statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Rui</au><au>Lagakos, Stephen W.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Inference after variable selection using restricted permutation methods</atitle><jtitle>Canadian journal of statistics</jtitle><addtitle>Can J Statistics</addtitle><date>2009-12</date><risdate>2009</risdate><volume>37</volume><issue>4</issue><spage>625</spage><epage>644</epage><pages>625-644</pages><issn>0319-5724</issn><eissn>1708-945X</eissn><abstract>When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard ("naive") methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study. Lorsque le statisticien doit choisir entre plusieurs covariables et qu'il n'a qu'une seule variable réponse, il doit souvent appliquer un algorithme de sélection de variables aux jeux de données afin d'identifier un sous-ensemble de covariables potentiellement associées à la variable réponse. Par la suite, il peut faire l'inférence sur les paramètres d'un modèle de l'association marginale entre les covariables choisies et la variable réponse. Si un autre jeu de données indépendant était disponible, les paramètres d'intérêt pourraient être estimés par les méthodes d'inférence courantes pour ajuster le modèle marginal considéré à ce jeu de données. Cependant, lorsque ces méthodes d'inférence sont utilisées sur le jeu de données qui a servi à la sélection des covariables, ces méthodes "naïves" peuvent produire un biais d'estimation. Les auteurs ont développé des tests et des méthodes d'estimation par intervalle pour les paramètres tenant compte de l'association marginale entre les covariables sélectionnés et la variable réponse, en utilisant les mêmes données qui ont servi à choisir les covariables. Ils donnent une justification théorique pour les méthodes proposées, présentent des résultats pour guider leur implantation, et ils utilisent des simulations pour mesu-rer et comparer leur performance par rapport à l'approche consistant à diviser l'échantillon. Ces méthodes sont appliquées à des données provenant d'une récente étude sur le SIDA.</abstract><cop>Hoboken, USA</cop><pub>John Wiley & Sons, Inc</pub><pmid>20368768</pmid><doi>10.1002/cjs.10039</doi><tpages>20</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0319-5724
ispartof	Canadian journal of statistics, 2009-12, Vol.37 (4), p.625-644
issn	0319-5724 1708-945X
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_2848082
source	International Bibliography of the Social Sciences (IBSS); JSTOR Archival Journals and Primary Sources Collection; Wiley-Blackwell Read & Publish Collection
subjects	AIDS Comparative analysis Confidence interval covariates Data analysis Datasets Economic models Economic statistics Empirical research Estimating techniques Feature selection Inference Information economics Linear models Linear regression Marginal analysis Mathematical economics Mathematical independent variables Measurement MSC 2000: Primary 62G09 Parametric models Permutation tests Probabilities regression Restricted permutation methods sample splitting secondary 62J05 Simulation Simulation techniques Statistical analysis Statistical data Statistical inference Statistical methods Statistical models Statistical tables Studies Variable selector Variable-selection algorithm Variance analysis
title	Inference after variable selection using restricted permutation methods
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T15%3A22%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Inference%20after%20variable%20selection%20using%20restricted%20permutation%20methods&rft.jtitle=Canadian%20journal%20of%20statistics&rft.au=Wang,%20Rui&rft.date=2009-12&rft.volume=37&rft.issue=4&rft.spage=625&rft.epage=644&rft.pages=625-644&rft.issn=0319-5724&rft.eissn=1708-945X&rft_id=info:doi/10.1002/cjs.10039&rft_dat=%3Cjstor_pubme%3E25653502%3C/jstor_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c5989-d57feff896b9a4aed1d08143bf22e213750134ec09aabad826a12cd47d49867e3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=200234397&rft_id=info:pmid/20368768&rft_jstor_id=25653502&rfr_iscdi=true