Loading…

Not all datasets are born equal: On heterogeneous tabular data and adversarial examples

Recent work on adversarial learning has mainly focused on neural networks and domains in which those networks excel, such as computer vision and audio processing. Typically, the data in those domains is homogeneous, whereas domains with heterogeneous tabular datasets remain underexplored, despite th...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems 2022-04, Vol.242, p.108377, Article 108377
Main Authors:	Mathov, Yael, Levy, Eden, Katzir, Ziv, Shabtai, Asaf, Elovici, Yuval
Format:	Article
Language:	English
Subjects:	Adversarial examples Adversarial learning Audio data Computer vision Datasets Machine learning Neural networks Optimization Perturbation Tables (data) Tabular data
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent work on adversarial learning has mainly focused on neural networks and domains in which those networks excel, such as computer vision and audio processing. Typically, the data in those domains is homogeneous, whereas domains with heterogeneous tabular datasets remain underexplored, despite their prevalence. When searching for adversarial patterns within heterogeneous input spaces, an attacker must simultaneously preserve the complex domain-specific validity rules of the data and the adversarial nature of the identified samples. As such, applying adversarial manipulations to heterogeneous datasets has proven challenging, and a generic attack method has not yet been proposed. However, this study argue that machine learning models trained on heterogeneous tabular data are as susceptible to adversarial manipulations as those trained on continuous or homogeneous data, such as images. To support this claim, a generic optimization framework for identifying adversarial perturbations in heterogeneous input spaces is introduced. The framework defines distribution-aware constraints to preserve the consistency of the adversarial examples and then incorporate them by embedding the heterogeneous input into a continuous latent space. Due to the nature of the underlying datasets, we focus on ℓ0 perturbations and demonstrate their applicability in real life. The effectiveness of the suggested approach is demonstrated using three datasets from different content domains. The results show that despite the constraints imposed on the input validity in heterogeneous datasets, machine learning models trained using such data are still susceptible to adversarial examples. [Display omitted] •Attacks on tabular data ignore complex features (nominal) and feature correlations.•Mathematically define a valid real-world heterogeneous adversarial example.•Use embedding function to preserve feature correlations and value consistency.•Implement and evaluate the framework in three data domains and learning models.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2022.108377