Loading…
Algorithmically generated malicious domain names detection based on n-grams features
•Identification of a botnet command and control server through DNS requests analysis.•Focus on Domain name Generation Algorithms (DGAs).•Use of a Machine Learning Classifier for malicious domain names detection.•Domain names characterization through lexical features (n-grams based).•Classification b...
Saved in:
Published in: | Expert systems with applications 2021-05, Vol.170, p.114551, Article 114551 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | •Identification of a botnet command and control server through DNS requests analysis.•Focus on Domain name Generation Algorithms (DGAs).•Use of a Machine Learning Classifier for malicious domain names detection.•Domain names characterization through lexical features (n-grams based).•Classification based on the Kullback-Leibner divergence and Jaccard Index metrics.
Botnets are one of the major cyber infections used in several criminal activities. In most botnets, a Domain Generation Algorithm (DGA) is used by bots to make DNS queries aimed at establishing the connection with the Command and Control (C&C) server. The identification of such queries by monitoring the network DNS traffic is then crucial for bot detection. In this paper we present a methodology to detect DGA generated domain names based on a supervised machine learning process, trained with a dataset of known benign and malicious domain names. The proposed approach represents the domain names through a set of features which express the similarity between the 2-grams and 3-grams in a single unclassified domain name and those in domain names known as malicious or benign. We used the Kullback-Leibner divergence and the Jaccard Index to estimate the similarity, and we tested different machine learning algorithms to classify each domain name as benign or DGA-based (with both binary and multi-class approach). The results of our experiments demonstrate that the proposed methodology, which only exploits lexical features of domain names, attains a good level of accuracy and results in a general model able to classify previously unseen domains in an effective way. It is also able to outperform some of the state-of-the-art featur eless classification methods based on deep learning. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2020.114551 |