Loading…

Codee: A Tensor Embedding Scheme for Binary Code Search

Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering 2022-07, Vol.48 (7), p.2224-2244
Main Authors: Yang, Jia, Fu, Cai, Liu, Xiao-Yang, Yin, Heng, Zhou, Pan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3
cites cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3
container_end_page 2244
container_issue 7
container_start_page 2224
container_title IEEE transactions on software engineering
container_volume 48
creator Yang, Jia
Fu, Cai
Liu, Xiao-Yang
Yin, Heng
Zhou, Pan
description Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-K K similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.
doi_str_mv 10.1109/TSE.2021.3056139
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2689807863</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9345532</ieee_id><sourcerecordid>2689807863</sourcerecordid><originalsourceid>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</originalsourceid><addsrcrecordid>eNo9kM1LAzEQxYMoWKt3wUvA89bJzs5m462W-gEFD63n0M1O7Ba7W5P24H_flBZhYJjh994MT4h7BSOlwDwt5tNRDrkaIVCp0FyIgTJoMqQcLsUAwFQZUWWuxU2MawAgrWkg9KRvmJ_lWC64i32Q003NTdN233LuVrxh6dPype2W4U8eWTnnZXCrW3Hllz-R7859KL5ep4vJezb7fPuYjGeZQ8RdxjVoJOeMdrWH2nhAnQpLRehNBYUvSGEaTQnksSl8Rb4hZdiQyRuHQ_F48t2G_nfPcWfX_T506aTNyyo56KrERMGJcqGPMbC329Bu0stWgT3GY1M89hiPPceTJA8nScvM_7jBgghzPACiV11I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2689807863</pqid></control><display><type>article</type><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</creator><creatorcontrib>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</creatorcontrib><description><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2021.3056139</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Binary codes ; code search ; Codes ; Data models ; Deep learning ; Embedding ; Feature extraction ; Function feature extraction ; Graph matching ; Machine learning ; Mathematical analysis ; Neural networks ; Optimization ; Repositories ; Search problems ; Searching ; Semantics ; Singular value decomposition ; Task analysis ; tensor embedding ; Tensors ; tSVD</subject><ispartof>IEEE transactions on software engineering, 2022-07, Vol.48 (7), p.2224-2244</ispartof><rights>Copyright IEEE Computer Society 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</citedby><cites>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</cites><orcidid>0000-0002-8629-4622 ; 0000-0002-8942-7742 ; 0000-0002-9532-1709 ; 0000-0003-1469-0789</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9345532$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Yang, Jia</creatorcontrib><creatorcontrib>Fu, Cai</creatorcontrib><creatorcontrib>Liu, Xiao-Yang</creatorcontrib><creatorcontrib>Yin, Heng</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></description><subject>Algorithms</subject><subject>Binary codes</subject><subject>code search</subject><subject>Codes</subject><subject>Data models</subject><subject>Deep learning</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>Function feature extraction</subject><subject>Graph matching</subject><subject>Machine learning</subject><subject>Mathematical analysis</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Repositories</subject><subject>Search problems</subject><subject>Searching</subject><subject>Semantics</subject><subject>Singular value decomposition</subject><subject>Task analysis</subject><subject>tensor embedding</subject><subject>Tensors</subject><subject>tSVD</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNo9kM1LAzEQxYMoWKt3wUvA89bJzs5m462W-gEFD63n0M1O7Ba7W5P24H_flBZhYJjh994MT4h7BSOlwDwt5tNRDrkaIVCp0FyIgTJoMqQcLsUAwFQZUWWuxU2MawAgrWkg9KRvmJ_lWC64i32Q003NTdN233LuVrxh6dPype2W4U8eWTnnZXCrW3Hllz-R7859KL5ep4vJezb7fPuYjGeZQ8RdxjVoJOeMdrWH2nhAnQpLRehNBYUvSGEaTQnksSl8Rb4hZdiQyRuHQ_F48t2G_nfPcWfX_T506aTNyyo56KrERMGJcqGPMbC329Bu0stWgT3GY1M89hiPPceTJA8nScvM_7jBgghzPACiV11I</recordid><startdate>20220701</startdate><enddate>20220701</enddate><creator>Yang, Jia</creator><creator>Fu, Cai</creator><creator>Liu, Xiao-Yang</creator><creator>Yin, Heng</creator><creator>Zhou, Pan</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-8629-4622</orcidid><orcidid>https://orcid.org/0000-0002-8942-7742</orcidid><orcidid>https://orcid.org/0000-0002-9532-1709</orcidid><orcidid>https://orcid.org/0000-0003-1469-0789</orcidid></search><sort><creationdate>20220701</creationdate><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><author>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Binary codes</topic><topic>code search</topic><topic>Codes</topic><topic>Data models</topic><topic>Deep learning</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>Function feature extraction</topic><topic>Graph matching</topic><topic>Machine learning</topic><topic>Mathematical analysis</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Repositories</topic><topic>Search problems</topic><topic>Searching</topic><topic>Semantics</topic><topic>Singular value decomposition</topic><topic>Task analysis</topic><topic>tensor embedding</topic><topic>Tensors</topic><topic>tSVD</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Jia</creatorcontrib><creatorcontrib>Fu, Cai</creatorcontrib><creatorcontrib>Liu, Xiao-Yang</creatorcontrib><creatorcontrib>Yin, Heng</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Jia</au><au>Fu, Cai</au><au>Liu, Xiao-Yang</au><au>Yin, Heng</au><au>Zhou, Pan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Codee: A Tensor Embedding Scheme for Binary Code Search</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2022-07-01</date><risdate>2022</risdate><volume>48</volume><issue>7</issue><spage>2224</spage><epage>2244</epage><pages>2224-2244</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2021.3056139</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-8629-4622</orcidid><orcidid>https://orcid.org/0000-0002-8942-7742</orcidid><orcidid>https://orcid.org/0000-0002-9532-1709</orcidid><orcidid>https://orcid.org/0000-0003-1469-0789</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0098-5589
ispartof IEEE transactions on software engineering, 2022-07, Vol.48 (7), p.2224-2244
issn 0098-5589
1939-3520
language eng
recordid cdi_proquest_journals_2689807863
source IEEE Electronic Library (IEL) Journals
subjects Algorithms
Binary codes
code search
Codes
Data models
Deep learning
Embedding
Feature extraction
Function feature extraction
Graph matching
Machine learning
Mathematical analysis
Neural networks
Optimization
Repositories
Search problems
Searching
Semantics
Singular value decomposition
Task analysis
tensor embedding
Tensors
tSVD
title Codee: A Tensor Embedding Scheme for Binary Code Search
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T15%3A06%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Codee:%20A%20Tensor%20Embedding%20Scheme%20for%20Binary%20Code%20Search&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Yang,%20Jia&rft.date=2022-07-01&rft.volume=48&rft.issue=7&rft.spage=2224&rft.epage=2244&rft.pages=2224-2244&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2021.3056139&rft_dat=%3Cproquest_cross%3E2689807863%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2689807863&rft_id=info:pmid/&rft_ieee_id=9345532&rfr_iscdi=true