Loading…
Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms
Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optim...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 424 |
container_issue | 1 |
container_start_page | 413 |
container_title | |
container_volume | 35 |
creator | Kuzin, Yakov Shcheka, Dmitriy Polyntsov, Michael Stupakov, Kirill Firsov, Mikhail Chernishev, George |
description | Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x. |
doi_str_mv | 10.23919/FRUCT61870.2024.10516381 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>doaj_CHZPO</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10516381</ieee_id><doaj_id>oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</doaj_id><sourcerecordid>oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93</sourcerecordid><originalsourceid>FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3</originalsourceid><addsrcrecordid>eNpN0MtOAjEYBeBqNJEgb-CiPgDYTm9Td4SLkpCQGFhPftq_UAJT7IwmvL1E1Lg6J2fxLQ4hj5wNCmG5fZq-rUZLzUtzHlghB5wprkXJr0jPmtKqQqtCas6uSacQTPVNoeTNv35Hek2zY4wVpdLWmg4Ji-wx01jTMTbrlD3ULT7TJbptHd8_sKEhZToJIbqIdUtnh-MeD-cGbUw1TYFegDEesfZYuxMdx8alT8wnOtxvUo7t9tDck9sA-wZ7P9klq-lkOXrtzxcvs9Fw3vdcmbbvpBbKgmTCowbNAQB1KYM0FplRaLzniDKIAKDWWlhvGDKnLGNKKelFl8wurk-wq445HiCfqgSx-h5S3lSQ2-j2WLHgSwBhzudoCaW3sOYhOFYGWDtnxdl6uFgREf-s38vFF4EEdeQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><source>IEEE Xplore All Conference Series</source><creator>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</creator><creatorcontrib>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</creatorcontrib><description>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</description><identifier>ISSN: 2305-7254</identifier><identifier>EISSN: 2305-7254</identifier><identifier>EISSN: 2343-0737</identifier><identifier>EISBN: 9789526524610</identifier><identifier>EISBN: 9526524616</identifier><identifier>DOI: 10.23919/FRUCT61870.2024.10516381</identifier><language>eng</language><publisher>FRUCT</publisher><subject>C++ languages ; Cleaning ; data profiling ; fastod ; Focusing ; Memory management ; order ; order dependency ; Query processing ; Technological innovation</subject><ispartof>2024 35th Conference of Open Innovations Association (FRUCT), 2024, Vol.35 (1), p.413-424</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10516381$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,314,780,784,789,790,27924,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10516381$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kuzin, Yakov</creatorcontrib><creatorcontrib>Shcheka, Dmitriy</creatorcontrib><creatorcontrib>Polyntsov, Michael</creatorcontrib><creatorcontrib>Stupakov, Kirill</creatorcontrib><creatorcontrib>Firsov, Mikhail</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><title>2024 35th Conference of Open Innovations Association (FRUCT)</title><addtitle>FRUCT</addtitle><description>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</description><subject>C++ languages</subject><subject>Cleaning</subject><subject>data profiling</subject><subject>fastod</subject><subject>Focusing</subject><subject>Memory management</subject><subject>order</subject><subject>order dependency</subject><subject>Query processing</subject><subject>Technological innovation</subject><issn>2305-7254</issn><issn>2305-7254</issn><issn>2343-0737</issn><isbn>9789526524610</isbn><isbn>9526524616</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2024</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>DOA</sourceid><recordid>eNpN0MtOAjEYBeBqNJEgb-CiPgDYTm9Td4SLkpCQGFhPftq_UAJT7IwmvL1E1Lg6J2fxLQ4hj5wNCmG5fZq-rUZLzUtzHlghB5wprkXJr0jPmtKqQqtCas6uSacQTPVNoeTNv35Hek2zY4wVpdLWmg4Ji-wx01jTMTbrlD3ULT7TJbptHd8_sKEhZToJIbqIdUtnh-MeD-cGbUw1TYFegDEesfZYuxMdx8alT8wnOtxvUo7t9tDck9sA-wZ7P9klq-lkOXrtzxcvs9Fw3vdcmbbvpBbKgmTCowbNAQB1KYM0FplRaLzniDKIAKDWWlhvGDKnLGNKKelFl8wurk-wq445HiCfqgSx-h5S3lSQ2-j2WLHgSwBhzudoCaW3sOYhOFYGWDtnxdl6uFgREf-s38vFF4EEdeQ</recordid><startdate>20240424</startdate><enddate>20240424</enddate><creator>Kuzin, Yakov</creator><creator>Shcheka, Dmitriy</creator><creator>Polyntsov, Michael</creator><creator>Stupakov, Kirill</creator><creator>Firsov, Mikhail</creator><creator>Chernishev, George</creator><general>FRUCT</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope><scope>DOA</scope></search><sort><creationdate>20240424</creationdate><title>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</title><author>Kuzin, Yakov ; Shcheka, Dmitriy ; Polyntsov, Michael ; Stupakov, Kirill ; Firsov, Mikhail ; Chernishev, George</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2024</creationdate><topic>C++ languages</topic><topic>Cleaning</topic><topic>data profiling</topic><topic>fastod</topic><topic>Focusing</topic><topic>Memory management</topic><topic>order</topic><topic>order dependency</topic><topic>Query processing</topic><topic>Technological innovation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kuzin, Yakov</creatorcontrib><creatorcontrib>Shcheka, Dmitriy</creatorcontrib><creatorcontrib>Polyntsov, Michael</creatorcontrib><creatorcontrib>Stupakov, Kirill</creatorcontrib><creatorcontrib>Firsov, Mikhail</creatorcontrib><creatorcontrib>Chernishev, George</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection><collection>Directory of Open Access Journals</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kuzin, Yakov</au><au>Shcheka, Dmitriy</au><au>Polyntsov, Michael</au><au>Stupakov, Kirill</au><au>Firsov, Mikhail</au><au>Chernishev, George</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms</atitle><btitle>2024 35th Conference of Open Innovations Association (FRUCT)</btitle><stitle>FRUCT</stitle><date>2024-04-24</date><risdate>2024</risdate><volume>35</volume><issue>1</issue><spage>413</spage><epage>424</epage><pages>413-424</pages><issn>2305-7254</issn><eissn>2305-7254</eissn><eissn>2343-0737</eissn><eisbn>9789526524610</eisbn><eisbn>9526524616</eisbn><abstract>Science-intensive data profiling focuses on discovery and validation of various patterns in datasets. This study considers discovery of one such pattern - order dependency (OD). Simply put, OD states that some list of columns is ordered according to another one. It is of use for database query optimization, data cleaning and deduplication, anomaly detection, and much more. Existing discovery methods have approached this problem solely from the algorithmic standpoint, without focusing on the implementation side. At the same time, this problem is very computationally intensive, and therefore this part should not be ignored, as it brings ODs closer to industrial use. In this paper, we study two algorithms for OD discovery which target different OD axiomatizations - FASTOD and ORDER. We start by reimplementing these algorithms in C++ in order to speed them up and lower their memory consumption. We then analyze their bottlenecks and propose several techniques which improve their performance even further. To perform evaluation, we have implemented these algorithms inside Desbordante - a science-intensive, high-performance, and open-source data profiling tool developed in C++. Experiments have demonstrated a performance improvement of up to 3x obtained by reimplemented versions, and, with the application of our techniques, up to 10x. Memory consumption has been lowered by up to 2.9x.</abstract><pub>FRUCT</pub><doi>10.23919/FRUCT61870.2024.10516381</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2305-7254 |
ispartof | 2024 35th Conference of Open Innovations Association (FRUCT), 2024, Vol.35 (1), p.413-424 |
issn | 2305-7254 2305-7254 2343-0737 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93 |
source | IEEE Xplore All Conference Series |
subjects | C++ languages Cleaning data profiling fastod Focusing Memory management order order dependency Query processing Technological innovation |
title | Order in Desbordante: Techniques for Efficient Implementation of Order Dependency Discovery Algorithms |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T14%3A03%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-doaj_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Order%20in%20Desbordante:%20Techniques%20for%20Efficient%20Implementation%20of%20Order%20Dependency%20Discovery%20Algorithms&rft.btitle=2024%2035th%20Conference%20of%20Open%20Innovations%20Association%20(FRUCT)&rft.au=Kuzin,%20Yakov&rft.date=2024-04-24&rft.volume=35&rft.issue=1&rft.spage=413&rft.epage=424&rft.pages=413-424&rft.issn=2305-7254&rft.eissn=2305-7254&rft_id=info:doi/10.23919/FRUCT61870.2024.10516381&rft.eisbn=9789526524610&rft.eisbn_list=9526524616&rft_dat=%3Cdoaj_CHZPO%3Eoai_doaj_org_article_0fd8aa3723064a8d9ab1ffc08fabcc93%3C/doaj_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-d157t-c46359a403de6a61aaae684f479e075e7dd1ee4f3faa5b639d70e0c59005554d3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10516381&rfr_iscdi=true |