Loading…

Sample size requirements for training high-dimensional risk predictors

A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the h...

Full description

Saved in:
Bibliographic Details
Published in:Biostatistics (Oxford, England) England), 2013-09, Vol.14 (4), p.639-652
Main Authors: Dobbin, Kevin K, Song, Xiao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
ISSN:1465-4644
1468-4357
DOI:10.1093/biostatistics/kxt022