Loading…
Data Twinning
In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the...
Saved in:
Published in: | Statistical analysis and data mining 2022-10, Vol.15 (5), p.598-610 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3 |
---|---|
cites | cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3 |
container_end_page | 610 |
container_issue | 5 |
container_start_page | 598 |
container_title | Statistical analysis and data mining |
container_volume | 15 |
creator | Vakayil, Akhil Joseph, V. Roshan |
description | In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation. |
doi_str_mv | 10.1002/sam.11574 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2709298098</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2709298098</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</originalsourceid><addsrcrecordid>eNp1j09PAjEQxRujiYgc_AYmnjwszMxud9sjwb8JxoN4btpua5bALrYQwre3uoYbp3mZ-c17eYzdIIwRgCZRr8eIvCrO2ABlThmKis6Puiwu2VWMSwBeAhYDNnrQW3272Ddt27Rf1-zC61V0o_85ZJ9Pj4vZSzZ_f36dTeeZzXMqMu25x7o0QkoiFLYSkmQJwiLU6L3hROkmeOGEsKZyYAwZbpPO047X-ZDd9b6b0H3vXNyqZbcLbYpUVEEyEyBFou57yoYuxuC82oRmrcNBIajftiq1VX9tEzvp2X2zcofToPqYvvUfP3QBUxk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2709298098</pqid></control><display><type>article</type><title>Data Twinning</title><source>Wiley-Blackwell Read & Publish Collection</source><creator>Vakayil, Akhil ; Joseph, V. Roshan</creator><creatorcontrib>Vakayil, Akhil ; Joseph, V. Roshan</creatorcontrib><description>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</description><identifier>ISSN: 1932-1864</identifier><identifier>EISSN: 1932-1872</identifier><identifier>DOI: 10.1002/sam.11574</identifier><language>eng</language><publisher>Hoboken: Wiley Subscription Services, Inc., A Wiley Company</publisher><subject>Algorithms ; Big Data ; Data compression ; data splitting ; Datasets ; testing ; training ; validation</subject><ispartof>Statistical analysis and data mining, 2022-10, Vol.15 (5), p.598-610</ispartof><rights>2022 The Authors. published by Wiley Periodicals LLC.</rights><rights>2022. This article is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</citedby><cites>FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</cites><orcidid>0000-0002-9430-5301 ; 0000-0003-3684-6617</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Vakayil, Akhil</creatorcontrib><creatorcontrib>Joseph, V. Roshan</creatorcontrib><title>Data Twinning</title><title>Statistical analysis and data mining</title><description>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</description><subject>Algorithms</subject><subject>Big Data</subject><subject>Data compression</subject><subject>data splitting</subject><subject>Datasets</subject><subject>testing</subject><subject>training</subject><subject>validation</subject><issn>1932-1864</issn><issn>1932-1872</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>24P</sourceid><recordid>eNp1j09PAjEQxRujiYgc_AYmnjwszMxud9sjwb8JxoN4btpua5bALrYQwre3uoYbp3mZ-c17eYzdIIwRgCZRr8eIvCrO2ABlThmKis6Puiwu2VWMSwBeAhYDNnrQW3272Ddt27Rf1-zC61V0o_85ZJ9Pj4vZSzZ_f36dTeeZzXMqMu25x7o0QkoiFLYSkmQJwiLU6L3hROkmeOGEsKZyYAwZbpPO047X-ZDd9b6b0H3vXNyqZbcLbYpUVEEyEyBFou57yoYuxuC82oRmrcNBIajftiq1VX9tEzvp2X2zcofToPqYvvUfP3QBUxk</recordid><startdate>202210</startdate><enddate>202210</enddate><creator>Vakayil, Akhil</creator><creator>Joseph, V. Roshan</creator><general>Wiley Subscription Services, Inc., A Wiley Company</general><general>Wiley Subscription Services, Inc</general><scope>24P</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-9430-5301</orcidid><orcidid>https://orcid.org/0000-0003-3684-6617</orcidid></search><sort><creationdate>202210</creationdate><title>Data Twinning</title><author>Vakayil, Akhil ; Joseph, V. Roshan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Big Data</topic><topic>Data compression</topic><topic>data splitting</topic><topic>Datasets</topic><topic>testing</topic><topic>training</topic><topic>validation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Vakayil, Akhil</creatorcontrib><creatorcontrib>Joseph, V. Roshan</creatorcontrib><collection>Wiley-Blackwell Open Access Collection</collection><collection>CrossRef</collection><jtitle>Statistical analysis and data mining</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vakayil, Akhil</au><au>Joseph, V. Roshan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Data Twinning</atitle><jtitle>Statistical analysis and data mining</jtitle><date>2022-10</date><risdate>2022</risdate><volume>15</volume><issue>5</issue><spage>598</spage><epage>610</epage><pages>598-610</pages><issn>1932-1864</issn><eissn>1932-1872</eissn><abstract>In this work, we develop a method named Twinning for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model‐independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide‐and‐conquer procedures and k‐fold cross validation.</abstract><cop>Hoboken</cop><pub>Wiley Subscription Services, Inc., A Wiley Company</pub><doi>10.1002/sam.11574</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-9430-5301</orcidid><orcidid>https://orcid.org/0000-0003-3684-6617</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1932-1864 |
ispartof | Statistical analysis and data mining, 2022-10, Vol.15 (5), p.598-610 |
issn | 1932-1864 1932-1872 |
language | eng |
recordid | cdi_proquest_journals_2709298098 |
source | Wiley-Blackwell Read & Publish Collection |
subjects | Algorithms Big Data Data compression data splitting Datasets testing training validation |
title | Data Twinning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T13%3A38%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Data%20Twinning&rft.jtitle=Statistical%20analysis%20and%20data%20mining&rft.au=Vakayil,%20Akhil&rft.date=2022-10&rft.volume=15&rft.issue=5&rft.spage=598&rft.epage=610&rft.pages=598-610&rft.issn=1932-1864&rft.eissn=1932-1872&rft_id=info:doi/10.1002/sam.11574&rft_dat=%3Cproquest_cross%3E2709298098%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3324-af5f1d6b8992218c78929608c10d1ffb522b89854e88cb7e0bb2b5c8cb354e5d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2709298098&rft_id=info:pmid/&rfr_iscdi=true |