Loading…

A methodology for conducting efficient sanitization of HTTP training datasets

The performance of anomaly-based intrusion detection systems depends on the quality of the datasets used to form normal activity profiles. Suitable datasets should include high volumes of real-life data free from attack instances. On account of this requirement, obtaining quality datasets from colle...

Full description

Saved in:
Bibliographic Details
Published in:Future generation computer systems 2020-08, Vol.109, p.67-82
Main Authors: Díaz-Verdejo, Jesús E., Estepa, Antonio, Estepa, Rafael, Madinabeitia, German, Muñoz-Calle, Fco. Javier
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The performance of anomaly-based intrusion detection systems depends on the quality of the datasets used to form normal activity profiles. Suitable datasets should include high volumes of real-life data free from attack instances. On account of this requirement, obtaining quality datasets from collected data requires a process of data sanitization that may be prohibitive if done manually, or uncertain if fully automated. In this work, we propose a sanitization approach for obtaining datasets from HTTP traces suited for training, testing, or validating anomaly-based attack detectors. Our methodology has two sequential phases. In the first phase, we clean known attacks from data using a pattern-based approach that relies on tools that detect URI-based known attacks. In the second phase, we complement the result of the first phase by conducting assisted manual labeling systematically and efficiently, setting the focus of expert examination not on the raw data (which would be millions of URIs), but on the set of words that compose the URIs. This dramatically downsizes the volume of data that requires expert discernment, making manual sanitization of large datasets feasible. We have applied our method to sanitize a trace that includes 45 million requests received by the library web server of the University of Seville. We were able to generate clean datasets in less than 84 h with only 33 h of manual supervision. We have also applied our method to some public benchmark datasets, confirming that attacks unnoticed by signature-based detectors can be discovered in a reduced time span. •Semi-automated generation of clean datasets from real HTTP traffic.•Suited for anomaly-based attack detectors development and assessment.•Supervision of vocabularies observed rather than on captured data.•Clean datasets produced from 45 M web service trace with reduced manual supervision.
ISSN:0167-739X
1872-7115
DOI:10.1016/j.future.2020.03.033