Loading…

Iterative random vs. Kennard-Stone sampling for IR spectrum-based classification task using PLS2-DA

External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for mo...

Full description

Saved in:
Bibliographic Details
Main Authors: Lee, Loong Chuen, Liong, Choong-Yeun, Jemain, Abdul Aziz
Format: Conference Proceeding
Language:English
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:External testing (ET) is preferred over auto-prediction (AP) or k-fold-cross-validation in estimating more realistic predictive ability of a statistical model. With IR spectra, Kennard-stone (KS) sampling algorithm is often used to split the data into training and test sets, i.e. respectively for model construction and for model testing. On the other hand, iterative random sampling (IRS) has not been the favored choice though it is theoretically more likely to produce reliable estimation. The aim of this preliminary work is to compare performances of KS and IRS in sampling a representative training set from an attenuated total reflectance – Fourier transform infrared spectral dataset (of four varieties of blue gel pen inks) for PLS2-DA modeling. The ‘best’ performance achievable from the dataset is estimated with AP on the full dataset (APF, error). Both IRS (n = 200) and KS were used to split the dataset in the ratio of 7:3. The classic decision rule (i.e. maximum value-based) is employed for new sample prediction via partial least squares – discriminant analysis (PLS2-DA). Error rate of each model was estimated repeatedly via: (a) AP on full data (APF, error); (b) AP on training set (APS, error); and (c) ET on the respective test set (ETS, error). A good PLS2-DA model is expected to produce APS, error and EVS, error that is similar to the APF, error. Bearing that in mind, the similarities between (a) APS, error vs. APF, error; (b) ETS, error vs. APF, error and; (c) APS, error vs. ETS, error were evaluated using correlation tests (i.e. Pearson and Spearman’s rank test), using series of PLS2-DA models computed from KS-set and IRS-set, respectively. Overall, models constructed from IRS-set exhibits more similarities between the internal and external error rates than the respective KS-set, i.e. less risk of overfitting. In conclusion, IRS is more reliable than KS in sampling representative training set.
ISSN:0094-243X
1551-7616
DOI:10.1063/1.5028031