Loading…

Data Twinning

In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the...

Full description

Saved in:
Bibliographic Details
Published in:Statistical analysis and data mining 2022-10, Vol.15 (5), p.598-610
Main Authors: Vakayil, Akhil, Joseph, V. Roshan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.
ISSN:1932-1864
1932-1872
DOI:10.1002/sam.11574