Loading…

GraphMoCo: A graph momentum contrast model for large-scale binary function representation learning

In the field of cybersecurity, the ability to compute similarity scores at the function level for binary code is of utmost importance. Considering that a single binary file may contain an extensive amount of functions, an effective learning framework must exhibit both high accuracy and efficiency wh...

Full description

Saved in:
Bibliographic Details
Published in:Neurocomputing (Amsterdam) 2024-03, Vol.575, p.127273, Article 127273
Main Authors: Sun, Runjin, Guo, Shize, Guo, Jinhong, Li, Wei, Zhang, Xingyu, Guo, Xi, Pan, Zhisong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In the field of cybersecurity, the ability to compute similarity scores at the function level for binary code is of utmost importance. Considering that a single binary file may contain an extensive amount of functions, an effective learning framework must exhibit both high accuracy and efficiency when handling substantial volumes of data. Nonetheless, conventional methods encounter several limitations. Firstly, accurately annotating different pairs of functions with appropriate labels poses a significant challenge, thereby making it difficult to employ supervised learning methods without risk of overtraining. Secondly, while SOTA models often rely on pre-trained encoders or fine-grained graph comparison techniques, these approaches suffer from drawbacks related to time and memory consumption. Thirdly, the momentum update algorithm utilized in graph-based contrastive learning models can result in information leakage. Surprisingly, none of the existing articles address this issue. This research focuses on addressing the challenges associated with large-scale Binary Code Similarity Detection (BCSD). To overcome the aforementioned problems, we propose GraphMoCo: a graph momentum contrast model that leverages multimodal structural information for efficient binary function representation learning on a large scale. We adopt an unsupervised learning strategy. Our approach eliminates the need for manual labeling. By leveraging the intrinsic structural information at multiple levels of the binary code, our model could achieve higher accuracy with a simple CNN-based model. By introducing the preshuffle mechanism, the issue of information leakage in graph momentum update algorithm is mitigated. The evaluation results indicate that GraphMoCo exhibits superior performance compared to SOTA approaches in the function pair search task, showing an average improvement of 7% on AUC, and 10% on MRR and Recall@1. Furthermore, GraphMoCo achieves a MAP of 0.93 on the more challenging dataset 2, which comprises a larger function pool. In a real-world scenario, specifically in known vulnerability searching, GraphMoCo achieves a MRR that surpasses existing SOTA models by 5%. •At the token level, a cross-platform representation model is introduced to mitigate out-of-vocabulary problems.•At the block level, the multimodal CNN encoding scheme called StrandCNN is utilized to generate embeddings.•The issue of information leakage is discussed and a preshuffle mechanism has been developed
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2024.127273