Loading…

Generic Feature Selection with Short Fat Data

Consider a regression problem in which there are many more explanatory variables than data points, ., ≫ . Essentially, without reducing the number of variables inference is impossible. So, we group the explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regres...

Full description

Saved in:
Bibliographic Details
Published in:Journal of the Indian Society of Agricultural Statistics 2014, Vol.68 (2), p.145-162
Main Authors: Clarke, B, Chu, J-H
Format: Article
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Consider a regression problem in which there are many more explanatory variables than data points, ., ≫ . Essentially, without reducing the number of variables inference is impossible. So, we group the explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of , , classes of statistics, clustering algorithms, penalty terms, and data types. When is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [ / ] statistics where is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an norm with high enough .
ISSN:0019-6363