Loading…
RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation
Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (" X " could be tuple, token, label, JSON, and so on). RPT is pre...
Saved in:
Published in: | Proceedings of the VLDB Endowment 2021-04, Vol.14 (8), p.1254-1261 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?
We answer this question by presenting RPT, a denoising autoencoder for
tuple-to-X
models ("
X
" could be tuple, token, label, JSON, and so on). RPT is pre-trained for a
tuple-to-tuple
model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/3457390.3457391 |