Loading…
I can't believe it's not (only) software: bionic distributed storage for Parquet files
There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage...
Saved in:
Published in: | Proceedings of the VLDB Endowment 2019-08, Vol.12 (12), p.1838-1841 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet.
Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by applications from different domains, while, at the same time, benefiting from deduplication well suited to Apache Parquet files and from selective reads of columns in the file.
In this demonstration we show, on the one hand, that by relying on the FPGA's dataflow processing model, it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput. On the other hand, we highlight the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage node and to offload filtering operations. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/3352063.3352079 |