Loading…
A Novel Feature Encoding Scheme for Machine Learning Based Malware Detection Systems
Malware detection is an ever-evolving area given that the strides in the detection capabilities being matched by radical attempts to bypass the detection. As the sophistication of malware continues to increase, the demand for innovative approaches to improve detection capabilities become paramount....
Saved in:
Published in: | IEEE access 2024, Vol.12, p.91187-91216 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Malware detection is an ever-evolving area given that the strides in the detection capabilities being matched by radical attempts to bypass the detection. As the sophistication of malware continues to increase, the demand for innovative approaches to improve detection capabilities become paramount. Machine learning/Deep learning models are being increasingly used for Malware Detection, however one of the most important and frequently overlooked aspects of building such models is feature encoding. This research paper explores the importance of feature encoding to improve the efficiency of threat detection and proposes a novel entropy-based encoding scheme for the categorical features present in the data extracted from malicious inputs. The KDDCUP99, UNSW-NB15 and CIC-Evasive-PDFMal2022 datasets have been used to evaluate the effectiveness of the proposed encoding scheme. The results of the proposed encoding scheme are validated against seven other encoding schemes to ascertain the credibility and usability of the proposed scheme. The efficiency of the proposed system evaluated by applying different encoded versions of the datasets to train various machine learning models and determining the classification performance of the models on each dataset. The machine learning models trained with the proposed encoding scheme produced stable classification results and outperformed other encoding schemes when dimensionality reduction was applied on the data. The ensemble classifier trained using the proposed scheme was able to classify the data with an F1 score of 99.99% when the dimension-reduced entropy-encoded KDD Cup99 dataset was used to build the model. On the CIC-Evasive-PDFMal2022 dataset, the entropy encoding has exhibited a slightly improved classification parameters with the ensemble methods yielding a peak F1 score of 99.27%. We have also determined the feature importance values of the features present in the datasets to study the change in the contribution levels of the features when multiple categorical encoding schemes are applied upon the data. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2024.3420080 |