Loading…

Synthetic sampling from small datasets: A modified mega-trend diffusion approach using k-nearest neighbors

Data generation techniques are one of the emerging trends in machine learning in the last decade. Despite huge data availability, small datasets are still an issue to tackle for decision making purposes. Synthetic data generation is a promising alternative for the small dataset problem. In addition,...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2022-01, Vol.236, p.107687, Article 107687
Main Authors: Sivakumar, Jayanth, Ramamurthy, Karthik, Radhakrishnan, Menaka, Won, Daehan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data generation techniques are one of the emerging trends in machine learning in the last decade. Despite huge data availability, small datasets are still an issue to tackle for decision making purposes. Synthetic data generation is a promising alternative for the small dataset problem. In addition, previous methodologies address the data generation for only one of the tasks: supervised or unsupervised. A modified Mega-Trend Diffusion (MTD) approach, k-Nearest Neighbor Mega-Trend Diffusion (kNNMTD), is proposed in this research to address these challenges. The method identifies the closest subsamples using the k-Nearest Neighbors (kNN) algorithm and applies MTD to the subsample neighbors to estimate the domain ranges. The proposed methodology provides the functionality to generate data for any data-driven tasks. kNNMTD is compared with baseline MTD, CTGAN, and synthetic minority oversampling technique (SMOTE) for classification tasks as well as against SMOTE for regression (SmoteR) for regression tasks. The proposed method is validated using some of the benchmark datasets as well as the simulated datasets along with a case study. Pairwise correlation difference (PCD) is used to compare the similarity between real and synthetic datasets. kNNMTD outperforms baseline MTD and CTGAN on all the datasets and shows statistical significance of the proposed methodology. Some of the benchmark datasets show low average PCD values as well as the statistical differences against SMOTE and SmoteR using kNNMTD. In the case study, kNNMTD generate data with the lowest PCD values compared to the other methods for both classification (1.2077) and ordinal regression (1.6017) tasks. •Tackle small dataset challenges for both supervised and unsupervised learning tasks using synthetic data generation.•A nearest neighbor-based megatrend diffusion is proposed in this research.•The proposed method generates synthetic data for both supervised and unsupervised learning tasks.•Focuses on retaining the attribute relations similar to the original dataset, reducing the information gap.•The training data is used only to identify the domain ranges which in turn improves the privacy of any sensitive data.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2021.107687