Loading…

Data Twinning

In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the...

Full description

Saved in:
Bibliographic Details
Published in:Statistical analysis and data mining 2022-10, Vol.15 (5), p.598-610
Main Authors: Vakayil, Akhil, Joseph, V. Roshan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3
cites cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3
container_end_page 610
container_issue 5
container_start_page 598
container_title Statistical analysis and data mining
container_volume 15
creator Vakayil, Akhil
Joseph, V. Roshan
description In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.
doi_str_mv 10.1002/sam.11574
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2709298098</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2709298098</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</originalsourceid><addsrcrecordid>eNp1j09PAjEQxRujiYgc_AYmnjwszMxud9sjwb8JxoN4btpua5bALrYQwre3uoYbp3mZ-c17eYzdIIwRgCZRr8eIvCrO2ABlThmKis6Puiwu2VWMSwBeAhYDNnrQW3272Ddt27Rf1-zC61V0o_85ZJ9Pj4vZSzZ_f36dTeeZzXMqMu25x7o0QkoiFLYSkmQJwiLU6L3hROkmeOGEsKZyYAwZbpPO047X-ZDd9b6b0H3vXNyqZbcLbYpUVEEyEyBFou57yoYuxuC82oRmrcNBIajftiq1VX9tEzvp2X2zcofToPqYvvUfP3QBUxk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2709298098</pqid></control><display><type>article</type><title>Data Twinning</title><source>Wiley-Blackwell Read &amp; Publish Collection</source><creator>Vakayil, Akhil ; Joseph, V. Roshan</creator><creatorcontrib>Vakayil, Akhil ; Joseph, V. Roshan</creatorcontrib><description>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</description><identifier>ISSN: 1932-1864</identifier><identifier>EISSN: 1932-1872</identifier><identifier>DOI: 10.1002/sam.11574</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc., A Wiley Company</publisher><subject>Algorithms ; Big Data ; Data compression ; data splitting ; Datasets ; testing ; training ; validation</subject><ispartof>Statistical analysis and data mining, 2022-10, Vol.15 (5), p.598-610</ispartof><rights>2022 The Authors. published by Wiley Periodicals LLC.</rights><rights>2022. This article is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</citedby><cites>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</cites><orcidid>0000-0002-9430-5301 ; 0000-0003-3684-6617</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Vakayil, Akhil</creatorcontrib><creatorcontrib>Joseph, V. Roshan</creatorcontrib><title>Data Twinning</title><title>Statistical analysis and data mining</title><description>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</description><subject>Algorithms</subject><subject>Big Data</subject><subject>Data compression</subject><subject>data splitting</subject><subject>Datasets</subject><subject>testing</subject><subject>training</subject><subject>validation</subject><issn>1932-1864</issn><issn>1932-1872</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>24P</sourceid><recordid>eNp1j09PAjEQxRujiYgc_AYmnjwszMxud9sjwb8JxoN4btpua5bALrYQwre3uoYbp3mZ-c17eYzdIIwRgCZRr8eIvCrO2ABlThmKis6Puiwu2VWMSwBeAhYDNnrQW3272Ddt27Rf1-zC61V0o_85ZJ9Pj4vZSzZ_f36dTeeZzXMqMu25x7o0QkoiFLYSkmQJwiLU6L3hROkmeOGEsKZyYAwZbpPO047X-ZDd9b6b0H3vXNyqZbcLbYpUVEEyEyBFou57yoYuxuC82oRmrcNBIajftiq1VX9tEzvp2X2zcofToPqYvvUfP3QBUxk</recordid><startdate>202210</startdate><enddate>202210</enddate><creator>Vakayil, Akhil</creator><creator>Joseph, V. Roshan</creator><general>Wiley Subscription Services, Inc., A Wiley Company</general><general>Wiley Subscription Services, Inc</general><scope>24P</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9430-5301</orcidid><orcidid>https://orcid.org/0000-0003-3684-6617</orcidid></search><sort><creationdate>202210</creationdate><title>Data Twinning</title><author>Vakayil, Akhil ; Joseph, V. Roshan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Big Data</topic><topic>Data compression</topic><topic>data splitting</topic><topic>Datasets</topic><topic>testing</topic><topic>training</topic><topic>validation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Vakayil, Akhil</creatorcontrib><creatorcontrib>Joseph, V. Roshan</creatorcontrib><collection>Wiley-Blackwell Open Access Collection</collection><collection>CrossRef</collection><jtitle>Statistical analysis and data mining</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vakayil, Akhil</au><au>Joseph, V. Roshan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Data Twinning</atitle><jtitle>Statistical analysis and data mining</jtitle><date>2022-10</date><risdate>2022</risdate><volume>15</volume><issue>5</issue><spage>598</spage><epage>610</epage><pages>598-610</pages><issn>1932-1864</issn><eissn>1932-1872</eissn><abstract>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc., A Wiley Company</pub><doi>10.1002/sam.11574</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-9430-5301</orcidid><orcidid>https://orcid.org/0000-0003-3684-6617</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-1864
ispartof Statistical analysis and data mining, 2022-10, Vol.15 (5), p.598-610
issn 1932-1864
1932-1872
language eng
recordid cdi_proquest_journals_2709298098
source Wiley-Blackwell Read & Publish Collection
subjects Algorithms
Big Data
Data compression
data splitting
Datasets
testing
training
validation
title Data Twinning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T13%3A38%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Data%20Twinning&rft.jtitle=Statistical%20analysis%20and%20data%20mining&rft.au=Vakayil,%20Akhil&rft.date=2022-10&rft.volume=15&rft.issue=5&rft.spage=598&rft.epage=610&rft.pages=598-610&rft.issn=1932-1864&rft.eissn=1932-1872&rft_id=info:doi/10.1002/sam.11574&rft_dat=%3Cproquest_cross%3E2709298098%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2709298098&rft_id=info:pmid/&rfr_iscdi=true