Loading…

An Improved Approximation to the Estimation of the Critical F Values in Best Subset Regression

Variable selection methods are routinely applied in regression modeling to identify a small number of descriptors which “best” explain the variation in the response variable. Most statistical packages that perform regression have some form of stepping algorithm that can be used in this identificatio...

Full description

Saved in:
Bibliographic Details
Published in:Journal of chemical information and modeling 2007-01, Vol.47 (1), p.143-149
Main Authors: Salt, David W, Ajmani, Subhash, Crichton, Ray, Livingstone, David J
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Variable selection methods are routinely applied in regression modeling to identify a small number of descriptors which “best” explain the variation in the response variable. Most statistical packages that perform regression have some form of stepping algorithm that can be used in this identification process. Unfortunately, when a subset of p variables measured on a sample of n objects are selected from a set of k (>p) to maximize the squared sample multiple regression coefficient, the significance of the resulting regression is upwardly biased. The extent of this bias is investigated by using Monte Carlo simulation and is presented as an inflation factor which when multiplied by the usual tabulated F ratio gives an estimate of the true 5% critical value. The results show that selection bias can be very high even for moderate-size data sets. Selecting three variables from 50 generated at random with 20 observations will almost certainly provide a significant result if the usual tabulated F values are used. An interpolation formula is provided for the calculation of the inflation factor for different combinations of (n, p, k). Four real data sets are examined to illustrate the effect of correlated descriptor variables on the degree of inflation.
ISSN:1549-9596
1549-960X
DOI:10.1021/ci060113n