Loading…

Codee: A Tensor Embedding Scheme for Binary Code Search

Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on software engineering 2022-07, Vol.48 (7), p.2224-2244
Main Authors:	Yang, Jia, Fu, Cai, Liu, Xiao-Yang, Yin, Heng, Zhou, Pan
Format:	Article
Language:	English
Subjects:	Algorithms Binary codes code search Codes Data models Deep learning Embedding Feature extraction Function feature extraction Graph matching Machine learning Mathematical analysis Neural networks Optimization Repositories Search problems Searching Semantics Singular value decomposition Task analysis tensor embedding Tensors tSVD
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3
cites	cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3
container_end_page	2244
container_issue	7
container_start_page	2224
container_title	IEEE transactions on software engineering
container_volume	48
creator	Yang, Jia Fu, Cai Liu, Xiao-Yang Yin, Heng Zhou, Pan
description	Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-K K similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.
doi_str_mv	10.1109/TSE.2021.3056139
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2689807863</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9345532</ieee_id><sourcerecordid>2689807863</sourcerecordid><originalsourceid>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</originalsourceid><addsrcrecordid>eNo9kM1LAzEQxYMoWKt3wUvA89bJzs5m462W-gEFD63n0M1O7Ba7W5P24H_flBZhYJjh994MT4h7BSOlwDwt5tNRDrkaIVCp0FyIgTJoMqQcLsUAwFQZUWWuxU2MawAgrWkg9KRvmJ_lWC64i32Q003NTdN233LuVrxh6dPype2W4U8eWTnnZXCrW3Hllz-R7859KL5ep4vJezb7fPuYjGeZQ8RdxjVoJOeMdrWH2nhAnQpLRehNBYUvSGEaTQnksSl8Rb4hZdiQyRuHQ_F48t2G_nfPcWfX_T506aTNyyo56KrERMGJcqGPMbC329Bu0stWgT3GY1M89hiPPceTJA8nScvM_7jBgghzPACiV11I</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2689807863</pqid></control><display><type>article</type><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</creator><creatorcontrib>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</creatorcontrib><description><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></description><identifier>ISSN: 0098-5589</identifier><identifier>EISSN: 1939-3520</identifier><identifier>DOI: 10.1109/TSE.2021.3056139</identifier><identifier>CODEN: IESEDJ</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Binary codes ; code search ; Codes ; Data models ; Deep learning ; Embedding ; Feature extraction ; Function feature extraction ; Graph matching ; Machine learning ; Mathematical analysis ; Neural networks ; Optimization ; Repositories ; Search problems ; Searching ; Semantics ; Singular value decomposition ; Task analysis ; tensor embedding ; Tensors ; tSVD</subject><ispartof>IEEE transactions on software engineering, 2022-07, Vol.48 (7), p.2224-2244</ispartof><rights>Copyright IEEE Computer Society 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</citedby><cites>FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</cites><orcidid>0000-0002-8629-4622 ; 0000-0002-8942-7742 ; 0000-0002-9532-1709 ; 0000-0003-1469-0789</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9345532$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids></links><search><creatorcontrib>Yang, Jia</creatorcontrib><creatorcontrib>Fu, Cai</creatorcontrib><creatorcontrib>Liu, Xiao-Yang</creatorcontrib><creatorcontrib>Yin, Heng</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><title>IEEE transactions on software engineering</title><addtitle>TSE</addtitle><description><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></description><subject>Algorithms</subject><subject>Binary codes</subject><subject>code search</subject><subject>Codes</subject><subject>Data models</subject><subject>Deep learning</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>Function feature extraction</subject><subject>Graph matching</subject><subject>Machine learning</subject><subject>Mathematical analysis</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Repositories</subject><subject>Search problems</subject><subject>Searching</subject><subject>Semantics</subject><subject>Singular value decomposition</subject><subject>Task analysis</subject><subject>tensor embedding</subject><subject>Tensors</subject><subject>tSVD</subject><issn>0098-5589</issn><issn>1939-3520</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNo9kM1LAzEQxYMoWKt3wUvA89bJzs5m462W-gEFD63n0M1O7Ba7W5P24H_flBZhYJjh994MT4h7BSOlwDwt5tNRDrkaIVCp0FyIgTJoMqQcLsUAwFQZUWWuxU2MawAgrWkg9KRvmJ_lWC64i32Q003NTdN233LuVrxh6dPype2W4U8eWTnnZXCrW3Hllz-R7859KL5ep4vJezb7fPuYjGeZQ8RdxjVoJOeMdrWH2nhAnQpLRehNBYUvSGEaTQnksSl8Rb4hZdiQyRuHQ_F48t2G_nfPcWfX_T506aTNyyo56KrERMGJcqGPMbC329Bu0stWgT3GY1M89hiPPceTJA8nScvM_7jBgghzPACiV11I</recordid><startdate>20220701</startdate><enddate>20220701</enddate><creator>Yang, Jia</creator><creator>Fu, Cai</creator><creator>Liu, Xiao-Yang</creator><creator>Yin, Heng</creator><creator>Zhou, Pan</creator><general>IEEE</general><general>IEEE Computer Society</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope><scope>K9.</scope><orcidid>https://orcid.org/0000-0002-8629-4622</orcidid><orcidid>https://orcid.org/0000-0002-8942-7742</orcidid><orcidid>https://orcid.org/0000-0002-9532-1709</orcidid><orcidid>https://orcid.org/0000-0003-1469-0789</orcidid></search><sort><creationdate>20220701</creationdate><title>Codee: A Tensor Embedding Scheme for Binary Code Search</title><author>Yang, Jia ; Fu, Cai ; Liu, Xiao-Yang ; Yin, Heng ; Zhou, Pan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Binary codes</topic><topic>code search</topic><topic>Codes</topic><topic>Data models</topic><topic>Deep learning</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>Function feature extraction</topic><topic>Graph matching</topic><topic>Machine learning</topic><topic>Mathematical analysis</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Repositories</topic><topic>Search problems</topic><topic>Searching</topic><topic>Semantics</topic><topic>Singular value decomposition</topic><topic>Task analysis</topic><topic>tensor embedding</topic><topic>Tensors</topic><topic>tSVD</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Jia</creatorcontrib><creatorcontrib>Fu, Cai</creatorcontrib><creatorcontrib>Liu, Xiao-Yang</creatorcontrib><creatorcontrib>Yin, Heng</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><jtitle>IEEE transactions on software engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yang, Jia</au><au>Fu, Cai</au><au>Liu, Xiao-Yang</au><au>Yin, Heng</au><au>Zhou, Pan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Codee: A Tensor Embedding Scheme for Binary Code Search</atitle><jtitle>IEEE transactions on software engineering</jtitle><stitle>TSE</stitle><date>2022-07-01</date><risdate>2022</risdate><volume>48</volume><issue>7</issue><spage>2224</spage><epage>2244</epage><pages>2224-2244</pages><issn>0098-5589</issn><eissn>1939-3520</eissn><coden>IESEDJ</coden><abstract><![CDATA[Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="fu-ieq1-3056139.gif"/> </inline-formula> similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.]]></abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TSE.2021.3056139</doi><tpages>21</tpages><orcidid>https://orcid.org/0000-0002-8629-4622</orcidid><orcidid>https://orcid.org/0000-0002-8942-7742</orcidid><orcidid>https://orcid.org/0000-0002-9532-1709</orcidid><orcidid>https://orcid.org/0000-0003-1469-0789</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0098-5589
ispartof	IEEE transactions on software engineering, 2022-07, Vol.48 (7), p.2224-2244
issn	0098-5589 1939-3520
language	eng
recordid	cdi_proquest_journals_2689807863
source	IEEE Electronic Library (IEL) Journals
subjects	Algorithms Binary codes code search Codes Data models Deep learning Embedding Feature extraction Function feature extraction Graph matching Machine learning Mathematical analysis Neural networks Optimization Repositories Search problems Searching Semantics Singular value decomposition Task analysis tensor embedding Tensors tSVD
title	Codee: A Tensor Embedding Scheme for Binary Code Search
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T15%3A06%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Codee:%20A%20Tensor%20Embedding%20Scheme%20for%20Binary%20Code%20Search&rft.jtitle=IEEE%20transactions%20on%20software%20engineering&rft.au=Yang,%20Jia&rft.date=2022-07-01&rft.volume=48&rft.issue=7&rft.spage=2224&rft.epage=2244&rft.pages=2224-2244&rft.issn=0098-5589&rft.eissn=1939-3520&rft.coden=IESEDJ&rft_id=info:doi/10.1109/TSE.2021.3056139&rft_dat=%3Cproquest_cross%3E2689807863%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c333t-eb0735cc97cbf0b9f03703736153f9804f45133619605f3d4f85fd519e9592dc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2689807863&rft_id=info:pmid/&rft_ieee_id=9345532&rfr_iscdi=true