Loading…

Training a language model to learn the syntax of commands

To protect systems from malicious activities, it is important to differentiate between valid and harmful commands. One way to achieve this is by learning the syntax of the commands, which is a complex task because of the expansive and evolving nature of command syntax. To address this, we harnessed...

Full description

Saved in:
Bibliographic Details
Published in:Array (New York) 2024-09, Vol.23, p.100355, Article 100355
Main Authors: Hussain, Zafar, Nurminen, Jukka K., Ranta-aho, Perttu
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To protect systems from malicious activities, it is important to differentiate between valid and harmful commands. One way to achieve this is by learning the syntax of the commands, which is a complex task because of the expansive and evolving nature of command syntax. To address this, we harnessed the power of a language model. Our methodology involved constructing a specialized vocabulary from our commands dataset, and training a custom tokenizer with a Masked Language Model head, resulting in the development of a BERT-like language model. This model exhibits proficiency in learning command syntax by predicting masked tokens. In comparative analyses, our language model outperformed the Markov Model in categorizing commands using clustering algorithms (DBSCAN, HDBSCAN, OPTICS). The language model achieved higher Silhouette scores (0.72, 0.88, 0.85) compared to the Markov Model (0.53, 0.25, 0.06) and demonstrated significantly lower noise levels (2.63%, 5.39%, 8.49%) versus the Markov Model’s higher noise rates (9.31%, 29.85%, 50.35%). Further validation with manually crafted syntax and BERTScore assessments consistently produced metrics above 0.90 for precision, recall, and F1-score. Our language model excels at learning command syntax, enhancing protective measures against malicious activities. •We trained a BERT-like Language Model with the commands data.•We created a second-order Markov Model to compare with our Language Model.•We created clusters after the two models masked the random tokens successfully for all the data.•We evaluated the performance of the clustering algorithms against four metrics.•We evaluated the performance of the Language Model against manually crafted syntax of select commands.
ISSN:2590-0056
2590-0056
DOI:10.1016/j.array.2024.100355