Loading…
Big Data Quality Anomaly Scoring Framework Using Artificial Intelligence
As Big Data becomes increasingly essential in decision-making processes, the quality of the data used is critical. Data quality anomalies can lead to incorrect results, making it essential to have automated detection frameworks to identify such anomalies. Although several methodologies have been pro...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As Big Data becomes increasingly essential in decision-making processes, the quality of the data used is critical. Data quality anomalies can lead to incorrect results, making it essential to have automated detection frameworks to identify such anomalies. Although several methodologies have been proposed to ensure the quality of Big Data, most of them are based on conventional cleaning tools and focus only on outlier values. However, Big Data may contain hidden anomalies that require more intelligent processing to detect. Anomalies are not limited to outliers and may contain nonconforming values, duplicates, missing values, and more. Furthermore, there is no methodology available that enables assessing the degree of anomalousness of the dataset, which is an essential metric for further data analysis. Therefore, to address this gap in the field, this paper proposes a data quality anomaly scoring framework that provide a more comprehensive and intelligent approach to Big Data quality anomaly detection and allows detecting quality anomalies related to six quality dimensions: accuracy, completeness, conformity, uniqueness, consistency, and readability. The framework also enables quality anomaly scoring that gives an idea about the extent to which poor quality data is anomalous and far from good quality data. Moreover, the performance of the framework was evaluated using a big dataset, showing promising results with an F-score up to 92,80% and an accuracy up to 99,62%. |
---|---|
ISSN: | 2327-1884 |
DOI: | 10.1109/CiSt56084.2023.10409909 |