Loading…
Number of components and prediction error in partial least squares regression determined by Monte Carlo resampling strategies
Using a metabolomics data set with 1057 serum samples, we designed and assessed different procedures based on Monte Carlo resampling schemes to determine the optimal number of components to be included in partial least squares (PLS) regression models. Corresponding estimates of prediction error were...
Saved in:
Published in: | Chemometrics and intelligent laboratory systems 2019-05, Vol.188, p.79-86 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Using a metabolomics data set with 1057 serum samples, we designed and assessed different procedures based on Monte Carlo resampling schemes to determine the optimal number of components to be included in partial least squares (PLS) regression models. Corresponding estimates of prediction error were calculated and compared in a single algorithm comprising i) a single loop Monte Carlo approach repeatedly and randomly splitting samples into calibration and validation samples, ii) a double loop validation splitting samples into calibration/validation and prediction sets, and, iii) independent sample sets in a third loop. In order to mimic the common situation with only a moderate number of samples available for building the model, only a fraction of the 1057 samples analyzed was randomly selected from the total sample set and used in the algorithm. The results show that if the samples available for modelling are representative for the future samples to be predicted from the model, the single loop Monte Carlo procedure consistently provides the same estimates of prediction errors as double loop resampling procedures and for 75% of the cases these estimates are the same as for independent prediction sets. This has important implications for optimal use of a training set for component selection and estimation of prediction error. Two methods were developed and compared for selecting the optimal number of PLS components defined as the number where no statistically significant improvement in prediction error is observed when additional components are included in the model. Both methods determine a probability measure and provide similar results for model selection in this application.
•A statistical probability measure is developed for model selection in multivariate regression.•Two different procedures for component selection based on Monte Carlo resampling is developed and assessed.•Estimates of prediction error from single and double loop validation is similar. This has implications for optimal use of training sets. |
---|---|
ISSN: | 0169-7439 1873-3239 |
DOI: | 10.1016/j.chemolab.2019.03.006 |