Loading…
RLclean: An unsupervised integrated data cleaning framework based on deep reinforcement learning
Data cleaning, a prerequisite to subsequent data analysis, has always been the focus of data science research. Datasets with errors can severely detract from the quality of downstream analytical results. Unfortunately, despite the proliferation of various data cleaning methods, it remains a time-con...
Saved in:
Published in: | Information sciences 2024-11, Vol.682, p.121281, Article 121281 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Data cleaning, a prerequisite to subsequent data analysis, has always been the focus of data science research. Datasets with errors can severely detract from the quality of downstream analytical results. Unfortunately, despite the proliferation of various data cleaning methods, it remains a time-consuming problem and frequently entails considerable labor expenses. In reality, errors are often heterogeneous and require different solutions. As a result, stand-alone methods often inadequate for addressing dirty data with multiple types of errors, while studies have demonstrated that combining such methods always require human intervention and the result remains unsatisfactory.
In this paper, we propose an unsupervised integrated data cleaning framework, namely RLclean. Based on deep reinforcement learning, RLclean takes advantages of multiple data cleaning techniques, enabling it to effectively clean multiple types of errors and achieve satisfactory results. Additionally, it eliminates the need for costly human involvement, as the cleaning strategy is learned by data-driven, which further allows the framework to self-adapt to diverse domains. RLclean mainly consists of two parts: (i) an integrated error detection model that unites multiple techniques to detect different types of errors from multiple views; and (ii) an integrated data repair model that learns the optimal repair operations and repairs dirty data in an unsupervised manner. Extensive experiments on benchmark datasets have demonstrated the superiority of RLclean over state-of-the-art methods. |
---|---|
ISSN: | 0020-0255 |
DOI: | 10.1016/j.ins.2024.121281 |