Loading…

Data Engineering for HPC with Python

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to trans...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2020-10
Main Authors: Abeykoon, Vibhatha, Perera, Niranda, Widanage, Chathura, Kamburugamuve, Supun, Thejaka, Amila Kanewala, Maithree, Hasara, Wickramasinghe, Pulasthi, Uyar, Ahmet, Fox, Geoffrey
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Abeykoon, Vibhatha
Perera, Niranda
Widanage, Chathura
Kamburugamuve, Supun
Thejaka, Amila Kanewala
Maithree, Hasara
Wickramasinghe, Pulasthi
Uyar, Ahmet
Fox, Geoffrey
description Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2450882355</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2450882355</sourcerecordid><originalsourceid>FETCH-proquest_journals_24508823553</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQcUksSVRwzUvPzEtNLcrMS1dIyy9S8AhwVijPLMlQCKgsycjP42FgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeCMTUwMLCyNjoOHEqQIAecEt4g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2450882355</pqid></control><display><type>article</type><title>Data Engineering for HPC with Python</title><source>Publicly Available Content (ProQuest)</source><creator>Abeykoon, Vibhatha ; Perera, Niranda ; Widanage, Chathura ; Kamburugamuve, Supun ; Thejaka, Amila Kanewala ; Maithree, Hasara ; Wickramasinghe, Pulasthi ; Uyar, Ahmet ; Fox, Geoffrey</creator><creatorcontrib>Abeykoon, Vibhatha ; Perera, Niranda ; Widanage, Chathura ; Kamburugamuve, Supun ; Thejaka, Amila Kanewala ; Maithree, Hasara ; Wickramasinghe, Pulasthi ; Uyar, Ahmet ; Fox, Geoffrey</creatorcontrib><description>Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Data processing ; Deep learning ; Distributed memory ; Engineering ; Engineering education ; Machine learning ; Mathematical analysis ; Matrix algebra ; Matrix methods ; Tensors</subject><ispartof>arXiv.org, 2020-10</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2450882355?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25732,36991,44569</link.rule.ids></links><search><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Kamburugamuve, Supun</creatorcontrib><creatorcontrib>Thejaka, Amila Kanewala</creatorcontrib><creatorcontrib>Maithree, Hasara</creatorcontrib><creatorcontrib>Wickramasinghe, Pulasthi</creatorcontrib><creatorcontrib>Uyar, Ahmet</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><title>Data Engineering for HPC with Python</title><title>arXiv.org</title><description>Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.</description><subject>Data processing</subject><subject>Deep learning</subject><subject>Distributed memory</subject><subject>Engineering</subject><subject>Engineering education</subject><subject>Machine learning</subject><subject>Mathematical analysis</subject><subject>Matrix algebra</subject><subject>Matrix methods</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQcUksSVRwzUvPzEtNLcrMS1dIyy9S8AhwVijPLMlQCKgsycjP42FgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeCMTUwMLCyNjoOHEqQIAecEt4g</recordid><startdate>20201013</startdate><enddate>20201013</enddate><creator>Abeykoon, Vibhatha</creator><creator>Perera, Niranda</creator><creator>Widanage, Chathura</creator><creator>Kamburugamuve, Supun</creator><creator>Thejaka, Amila Kanewala</creator><creator>Maithree, Hasara</creator><creator>Wickramasinghe, Pulasthi</creator><creator>Uyar, Ahmet</creator><creator>Fox, Geoffrey</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20201013</creationdate><title>Data Engineering for HPC with Python</title><author>Abeykoon, Vibhatha ; Perera, Niranda ; Widanage, Chathura ; Kamburugamuve, Supun ; Thejaka, Amila Kanewala ; Maithree, Hasara ; Wickramasinghe, Pulasthi ; Uyar, Ahmet ; Fox, Geoffrey</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24508823553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Data processing</topic><topic>Deep learning</topic><topic>Distributed memory</topic><topic>Engineering</topic><topic>Engineering education</topic><topic>Machine learning</topic><topic>Mathematical analysis</topic><topic>Matrix algebra</topic><topic>Matrix methods</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Abeykoon, Vibhatha</creatorcontrib><creatorcontrib>Perera, Niranda</creatorcontrib><creatorcontrib>Widanage, Chathura</creatorcontrib><creatorcontrib>Kamburugamuve, Supun</creatorcontrib><creatorcontrib>Thejaka, Amila Kanewala</creatorcontrib><creatorcontrib>Maithree, Hasara</creatorcontrib><creatorcontrib>Wickramasinghe, Pulasthi</creatorcontrib><creatorcontrib>Uyar, Ahmet</creatorcontrib><creatorcontrib>Fox, Geoffrey</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Abeykoon, Vibhatha</au><au>Perera, Niranda</au><au>Widanage, Chathura</au><au>Kamburugamuve, Supun</au><au>Thejaka, Amila Kanewala</au><au>Maithree, Hasara</au><au>Wickramasinghe, Pulasthi</au><au>Uyar, Ahmet</au><au>Fox, Geoffrey</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Data Engineering for HPC with Python</atitle><jtitle>arXiv.org</jtitle><date>2020-10-13</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2450882355
source Publicly Available Content (ProQuest)
subjects Data processing
Deep learning
Distributed memory
Engineering
Engineering education
Machine learning
Mathematical analysis
Matrix algebra
Matrix methods
Tensors
title Data Engineering for HPC with Python
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T18%3A18%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Data%20Engineering%20for%20HPC%20with%20Python&rft.jtitle=arXiv.org&rft.au=Abeykoon,%20Vibhatha&rft.date=2020-10-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2450882355%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_24508823553%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2450882355&rft_id=info:pmid/&rfr_iscdi=true