Loading…

Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms

Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optim...

Full description

Saved in:
Bibliographic Details
Main Authors: Kuzin, Yakov, Shcheka, Dmitriy, Polyntsov, Michael, Stupakov, Kirill, Firsov, Mikhail, Chernishev, George
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 424
container_issue 1
container_start_page 413
container_title
container_volume 35
creator Kuzin, Yakov
Shcheka, Dmitriy
Polyntsov, Michael
Stupakov, Kirill
Firsov, Mikhail
Chernishev, George
description Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.
doi_str_mv 10.23919/FRUCT61870.2024.10516381
format conference_proceeding
fullrecord <record><control><sourceid>doaj_CHZPO</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10516381</ieee_id><doaj_id>oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</doaj_id><sourcerecordid>oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</sourcerecordid><originalsourceid>FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3</originalsourceid><addsrcrecordid>eNpN0MtOAjEYBeBqNJEgb-CiPgDYTm9Td4SLkpCQGFhPftq_UAJT7IwmvL1E1Lg6J2fxLQ4hj5wNCmG5fZq-rUZLzUtzHlghB5wprkXJr0jPmtKqQqtCas6uSacQTPVNoeTNv35Hek2zY4wVpdLWmg4Ji-wx01jTMTbrlD3ULT7TJbptHd8_sKEhZToJIbqIdUtnh-MeD-cGbUw1TYFegDEesfZYuxMdx8alT8wnOtxvUo7t9tDck9sA-wZ7P9klq-lkOXrtzxcvs9Fw3vdcmbbvpBbKgmTCowbNAQB1KYM0FplRaLzniDKIAKDWWlhvGDKnLGNKKelFl8wurk-wq445HiCfqgSx-h5S3lSQ2-j2WLHgSwBhzudoCaW3sOYhOFYGWDtnxdl6uFgREf-s38vFF4EEdeQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><source>IEEE Xplore All Conference Series</source><creator>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</creator><creatorcontrib>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</creatorcontrib><description>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</description><identifier>ISSN: 2305-7254</identifier><identifier>EISSN: 2305-7254</identifier><identifier>EISSN: 2343-0737</identifier><identifier>EISBN: 9789526524610</identifier><identifier>EISBN: 9526524616</identifier><identifier>DOI: 10.23919/FRUCT61870.2024.10516381</identifier><language>eng</language><publisher>FRUCT</publisher><subject>C++ languages ; Cleaning ; data profiling ; fastod ; Focusing ; Memory management ; order ; order dependency ; Query processing ; Technological innovation</subject><ispartof>2024 35th Conference of Open Innovations Association (FRUCT), 2024, Vol.35 (1), p.413-424</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10516381$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,314,780,784,789,790,27924,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10516381$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kuzin, Yakov</creatorcontrib><creatorcontrib>Shcheka, Dmitriy</creatorcontrib><creatorcontrib>Polyntsov, Michael</creatorcontrib><creatorcontrib>Stupakov, Kirill</creatorcontrib><creatorcontrib>Firsov, Mikhail</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><title>2024 35th Conference of Open Innovations Association (FRUCT)</title><addtitle>FRUCT</addtitle><description>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</description><subject>C++ languages</subject><subject>Cleaning</subject><subject>data profiling</subject><subject>fastod</subject><subject>Focusing</subject><subject>Memory management</subject><subject>order</subject><subject>order dependency</subject><subject>Query processing</subject><subject>Technological innovation</subject><issn>2305-7254</issn><issn>2305-7254</issn><issn>2343-0737</issn><isbn>9789526524610</isbn><isbn>9526524616</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>DOA</sourceid><recordid>eNpN0MtOAjEYBeBqNJEgb-CiPgDYTm9Td4SLkpCQGFhPftq_UAJT7IwmvL1E1Lg6J2fxLQ4hj5wNCmG5fZq-rUZLzUtzHlghB5wprkXJr0jPmtKqQqtCas6uSacQTPVNoeTNv35Hek2zY4wVpdLWmg4Ji-wx01jTMTbrlD3ULT7TJbptHd8_sKEhZToJIbqIdUtnh-MeD-cGbUw1TYFegDEesfZYuxMdx8alT8wnOtxvUo7t9tDck9sA-wZ7P9klq-lkOXrtzxcvs9Fw3vdcmbbvpBbKgmTCowbNAQB1KYM0FplRaLzniDKIAKDWWlhvGDKnLGNKKelFl8wurk-wq445HiCfqgSx-h5S3lSQ2-j2WLHgSwBhzudoCaW3sOYhOFYGWDtnxdl6uFgREf-s38vFF4EEdeQ</recordid><startdate>20240424</startdate><enddate>20240424</enddate><creator>Kuzin, Yakov</creator><creator>Shcheka, Dmitriy</creator><creator>Polyntsov, Michael</creator><creator>Stupakov, Kirill</creator><creator>Firsov, Mikhail</creator><creator>Chernishev, George</creator><general>FRUCT</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope><scope>DOA</scope></search><sort><creationdate>20240424</creationdate><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><author>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>C++ languages</topic><topic>Cleaning</topic><topic>data profiling</topic><topic>fastod</topic><topic>Focusing</topic><topic>Memory management</topic><topic>order</topic><topic>order dependency</topic><topic>Query processing</topic><topic>Technological innovation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kuzin, Yakov</creatorcontrib><creatorcontrib>Shcheka, Dmitriy</creatorcontrib><creatorcontrib>Polyntsov, Michael</creatorcontrib><creatorcontrib>Stupakov, Kirill</creatorcontrib><creatorcontrib>Firsov, Mikhail</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection><collection>Directory of Open Access Journals</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kuzin, Yakov</au><au>Shcheka, Dmitriy</au><au>Polyntsov, Michael</au><au>Stupakov, Kirill</au><au>Firsov, Mikhail</au><au>Chernishev, George</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</atitle><btitle>2024 35th Conference of Open Innovations Association (FRUCT)</btitle><stitle>FRUCT</stitle><date>2024-04-24</date><risdate>2024</risdate><volume>35</volume><issue>1</issue><spage>413</spage><epage>424</epage><pages>413-424</pages><issn>2305-7254</issn><eissn>2305-7254</eissn><eissn>2343-0737</eissn><eisbn>9789526524610</eisbn><eisbn>9526524616</eisbn><abstract>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</abstract><pub>FRUCT</pub><doi>10.23919/FRUCT61870.2024.10516381</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2305-7254
ispartof 2024 35th Conference of Open Innovations Association (FRUCT), 2024, Vol.35 (1), p.413-424
issn 2305-7254
2305-7254
2343-0737
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93
source IEEE Xplore All Conference Series
subjects C++ languages
Cleaning
data profiling
fastod
Focusing
Memory management
order
order dependency
Query processing
Technological innovation
title Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T14%3A03%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-doaj_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Order%20in%20Desbordante:%20Techniques%20for%20Efficient%20Implementation%20of%20Order%20Dependency%20Discovery%20Algorithms&rft.btitle=2024%2035th%20Conference%20of%20Open%20Innovations%20Association%20(FRUCT)&rft.au=Kuzin,%20Yakov&rft.date=2024-04-24&rft.volume=35&rft.issue=1&rft.spage=413&rft.epage=424&rft.pages=413-424&rft.issn=2305-7254&rft.eissn=2305-7254&rft_id=info:doi/10.23919/FRUCT61870.2024.10516381&rft.eisbn=9789526524610&rft.eisbn_list=9526524616&rft_dat=%3Cdoaj_CHZPO%3Eoai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93%3C/doaj_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10516381&rfr_iscdi=true