Loading…

Machine Learning with Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2022-01, Vol.10, p.1-1
Main Authors: Lin, Ying-Dar, Liu, Zi-Qiang, Hwang, Ren-Hung, Nguyen, Van-Linh, Lin, Po-Ching, Lai, Yuan-Cheng
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and appropriate data segmentation. Improper handling of the issues will significantly degrade ML performance, e.g., resulting in high false-negative and low recall rates. Despite many efforts have done in the literature, detecting security attacks in a complicated network environment with imperfect data collection is still an open issue. This work proposes a machine learning framework with a combination of variational autoencoder and multilayer perceptron to deal with imbalanced datasets and detecting the explosion of attack variants on the Internet. The detection engine also includes an efficient range-based sequential search algorithm to address the segmentation challenge in data pre-processing from multiple sources (network packets, system/statistic logs) effectively. Our work is the first attempt to demonstrate the effect of using an appropriate combination of ML models for boosting IDS detection capability in a heterogeneous environment, where data collection imperfection is common. Experimental results on a public system log dataset (e.g., HDFS) show that our method gains approximately as much as 97% on F1 score and 98% on recall rate, a promising result compared to the same measurement of other solutions. Even better, we found that the proposed treatment of imbalanced datasets can improve up to 35% on the F1 score and 27% on recall rate. The testing results also indicate that our model can detect new attack variants. Code is available at https://github.com/tuonglinhhm/Hybrid-Learning-AutoEncoder-IDS.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3149295