Loading…

NeuroYara: Learning to Rank for Yara Rules Generation through Deep Language Modeling and Discriminative N-gram Encoding

Signature-based malware detection methods are recognized for their simplicity, explainability, and efficiency. One of the most commonly used tools is Yara, which provides the syntax for crafting malware signatures. However, while developing high-quality Yara rules requires significant expertise in m...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on dependable and secure computing 2024-09, p.1-17
Main Authors: Mansour, Ziad, Ou, Weihan, Ding, Steven H. H., Zulkernine, Mohammad, Charland, Philippe
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Signature-based malware detection methods are recognized for their simplicity, explainability, and efficiency. One of the most commonly used tools is Yara, which provides the syntax for crafting malware signatures. However, while developing high-quality Yara rules requires significant expertise in malware analysis, training such skilled analysts can be both resource-intensive and time-consuming. While a few works have been conducted to automate the generation of signatures, signatures generated by those works typically underperform the manually generated ones. In addition, these automated methods often depend on large static databases of hard-coded byte n-grams to minimize false positives. Instead of storing a large non-inclusive database to score byte n-grams, we propose a novel architecture utilizing two learning to rank neural networks to understand the underlying effectiveness and correlations among n-grams extracted for rule construction. This approach provides better flexibility and coverage of possible n-grams while reducing the required storage size from several GBs to only 10MBs. Combining these two models with a hierarchical density-based clustering method allows us to group multiple n-grams into logical conditions as Yara rules of higher quality. Experimental results show that our framework, NeuroYara, reduces the resources invested by analysts while generating rules with a low false-positive rate outperforming existing tools and manually-generated rules.
ISSN:1545-5971
1941-0018
DOI:10.1109/TDSC.2024.3449641