Loading…

Validation of classification models in cancer studies using simulated spectral data – A “sandbox” concept

Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery – anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classif...

Full description

Saved in:

Bibliographic Details
Published in:	Chemometrics and intelligent laboratory systems 2022-06, Vol.225, p.104564, Article 104564
Main Authors:	Boichenko, Ekaterina, Panchenko, Andrey, Tyndyk, Margarita, Maydin, Mikhail, Kruglov, Stepan, Artyushenko, Viacheslav, Kirsanov, Dmitry
Format:	Article
Language:	English
Subjects:	Cancer diagnostics Classification model Machine learning Near-infrared spectroscopy Validation
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery – anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classification models is required to ensure the reliability of the obtained results. In this study, we suggest using real data for simulation of spectral sets with varying characteristics (size, distribution of classes) – an analog of “sandbox” used in software development – and to validate the models in different conditions. Near-infrared spectra (939–1796 nm) measured from breast tumors and healthy tissues of laboratory mice (152 spectra) were used for simulation of spectral data sets of different sizes (50, 100, 150 spectra). We proposed a simple simulation method based on a singular value decomposition of the real spectral dataset and rearrangement of the calculated residuals. Several algorithms of training and test set selection have been applied to the simulated data (Kennard-Stone, DUPLEX, random, Monte-Carlo cross-validation), and corresponding Support Vector Machines classification models have been trained, optimized, and validated by using a series of test sets with varying “healthy: tumor” classes distribution (1:1,3:1,1:3) and size (10%, 30%, and 50% of the training data set). Performance of the classification models, expressed in values of accuracy, sensitivity, and selectivity, has been compared, and a validation strategy has been proposed. •A “sandbox” concept was proposed for validation of classification models.•Using simulated spectra, it is possible to validate a model under various conditions.•Different factors can be included in the “sandbox”, e.g. the distribution of classes.
ISSN:	0169-7439 1873-3239
DOI:	10.1016/j.chemolab.2022.104564