Loading…

Virtual Sample Generation for Retraining the Malicious PDF Detection Model

PDF files are adopted for launching cyberattacks because of their popularity and the increasing number of relative vulnerabilities. Machine learning algorithms are developed to detect the maliciousness of PDF files. As the exploits of new vulnerabilities occur, the assumption that the training data...

Full description

Saved in:
Bibliographic Details
Published in:Journal of physics. Conference series 2020-07, Vol.1584 (1), p.12056
Main Authors: He, Kang, Liu, Long, Lu, Dong-Zhe, Zhu, Yuefei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:PDF files are adopted for launching cyberattacks because of their popularity and the increasing number of relative vulnerabilities. Machine learning algorithms are developed to detect the maliciousness of PDF files. As the exploits of new vulnerabilities occur, the assumption that the training data and the test data share the same distribution does not hold and the ability of origin model to detect exploits of new vulnerabilities weakens gradually. In a real environment, it is very difficult to obtain numerous samples of exploits with the same CVE. and the machine learning models are difficult to be improved by retraining. Virtual sample generation could be used to generate sufficient virtual samples by small sample sets to improve the generalization of the existing model. A new VSG algorithm based on prior knowledge is proposed in this paper, which performs better than other VSG algorithms in improving the detection on exploits of new vulnerabilities.
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/1584/1/012056