Loading…
Validation of classification models in cancer studies using simulated spectral data – A “sandbox” concept
Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery – anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classif...
Saved in:
Published in: | Chemometrics and intelligent laboratory systems 2022-06, Vol.225, p.104564, Article 104564 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery – anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classification models is required to ensure the reliability of the obtained results. In this study, we suggest using real data for simulation of spectral sets with varying characteristics (size, distribution of classes) – an analog of “sandbox” used in software development – and to validate the models in different conditions. Near-infrared spectra (939–1796 nm) measured from breast tumors and healthy tissues of laboratory mice (152 spectra) were used for simulation of spectral data sets of different sizes (50, 100, 150 spectra). We proposed a simple simulation method based on a singular value decomposition of the real spectral dataset and rearrangement of the calculated residuals. Several algorithms of training and test set selection have been applied to the simulated data (Kennard-Stone, DUPLEX, random, Monte-Carlo cross-validation), and corresponding Support Vector Machines classification models have been trained, optimized, and validated by using a series of test sets with varying “healthy: tumor” classes distribution (1:1,3:1,1:3) and size (10%, 30%, and 50% of the training data set). Performance of the classification models, expressed in values of accuracy, sensitivity, and selectivity, has been compared, and a validation strategy has been proposed.
•A “sandbox” concept was proposed for validation of classification models.•Using simulated spectra, it is possible to validate a model under various conditions.•Different factors can be included in the “sandbox”, e.g. the distribution of classes. |
---|---|
ISSN: | 0169-7439 1873-3239 |
DOI: | 10.1016/j.chemolab.2022.104564 |