Loading…
Approach for estimating similarity between procedures in differently compiled binaries
•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success ra...
Saved in:
Published in: | Information and software technology 2015-02, Vol.58, p.259-271 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03 |
---|---|
cites | cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03 |
container_end_page | 271 |
container_issue | |
container_start_page | 259 |
container_title | Information and software technology |
container_volume | 58 |
creator | Stojanovic, Sasa Radivojevic, Zaharije Cvetanovic, Milos |
description | •New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate.
Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available.
The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets.
The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes.
The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision.
The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer. |
doi_str_mv | 10.1016/j.infsof.2014.06.012 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1669894287</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0950584914001517</els_id><sourcerecordid>3510790851</sourcerecordid><originalsourceid>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</originalsourceid><addsrcrecordid>eNp9kD1PwzAURS0EEqXwDxgssbAkPDuJ4yxIVcWXVIkFWK3EeQZHiVPsBNR_j6syMTC95dyr-w4hlwxSBkzcdKl1Jowm5cDyFEQKjB-RBZNllgjgxTFZQFVAUsi8OiVnIXQArIQMFuRttd36sdYf1IyeYpjsUE_WvdNgB9vX3k472uD0jehoBDW2s8dAraOtNQY9uqnfUT0OW9tjSxvrYgbDOTkxdR_w4vcuyev93cv6Mdk8PzytV5tE5wBT0tR5mQmUnAlZsrZB5LrGEmQuS80yIXTTgjAgRCMFZBkUTPKclSYDaBsD2ZJcH3rjts85zleDDRr7vnY4zkExISpZ5TyaWJKrP2g3zt7FdZHiUhSsKnmk8gOl_RiCR6O2PirxO8VA7WWrTh1kq71sBUJF2TF2e4hhfPbLoldBW3RRl_WoJ9WO9v-CH0u8ifY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1628651972</pqid></control><display><type>article</type><title>Approach for estimating similarity between procedures in differently compiled binaries</title><source>ScienceDirect Freedom Collection</source><creator>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</creator><creatorcontrib>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</creatorcontrib><description>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate.
Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available.
The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets.
The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes.
The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision.
The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</description><identifier>ISSN: 0950-5849</identifier><identifier>EISSN: 1873-6025</identifier><identifier>DOI: 10.1016/j.infsof.2014.06.012</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Arithmetic ; Benchmarking ; Binary code analysis ; Binary codes ; Binary system ; Clone detection ; Codes ; Computer programs ; Estimating techniques ; Mathematical problems ; Programming languages ; Recall ; Semantic clone ; Similarity ; Software ; Software clone ; Software engineering ; Software metric ; Studies</subject><ispartof>Information and software technology, 2015-02, Vol.58, p.259-271</ispartof><rights>2014 Elsevier B.V.</rights><rights>Copyright Elsevier Science Ltd. Feb 2015</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</citedby><cites>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Stojanovic, Sasa</creatorcontrib><creatorcontrib>Radivojevic, Zaharije</creatorcontrib><creatorcontrib>Cvetanovic, Milos</creatorcontrib><title>Approach for estimating similarity between procedures in differently compiled binaries</title><title>Information and software technology</title><description>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate.
Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available.
The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets.
The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes.
The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision.
The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</description><subject>Arithmetic</subject><subject>Benchmarking</subject><subject>Binary code analysis</subject><subject>Binary codes</subject><subject>Binary system</subject><subject>Clone detection</subject><subject>Codes</subject><subject>Computer programs</subject><subject>Estimating techniques</subject><subject>Mathematical problems</subject><subject>Programming languages</subject><subject>Recall</subject><subject>Semantic clone</subject><subject>Similarity</subject><subject>Software</subject><subject>Software clone</subject><subject>Software engineering</subject><subject>Software metric</subject><subject>Studies</subject><issn>0950-5849</issn><issn>1873-6025</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kD1PwzAURS0EEqXwDxgssbAkPDuJ4yxIVcWXVIkFWK3EeQZHiVPsBNR_j6syMTC95dyr-w4hlwxSBkzcdKl1Jowm5cDyFEQKjB-RBZNllgjgxTFZQFVAUsi8OiVnIXQArIQMFuRttd36sdYf1IyeYpjsUE_WvdNgB9vX3k472uD0jehoBDW2s8dAraOtNQY9uqnfUT0OW9tjSxvrYgbDOTkxdR_w4vcuyev93cv6Mdk8PzytV5tE5wBT0tR5mQmUnAlZsrZB5LrGEmQuS80yIXTTgjAgRCMFZBkUTPKclSYDaBsD2ZJcH3rjts85zleDDRr7vnY4zkExISpZ5TyaWJKrP2g3zt7FdZHiUhSsKnmk8gOl_RiCR6O2PirxO8VA7WWrTh1kq71sBUJF2TF2e4hhfPbLoldBW3RRl_WoJ9WO9v-CH0u8ifY</recordid><startdate>20150201</startdate><enddate>20150201</enddate><creator>Stojanovic, Sasa</creator><creator>Radivojevic, Zaharije</creator><creator>Cvetanovic, Milos</creator><general>Elsevier B.V</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20150201</creationdate><title>Approach for estimating similarity between procedures in differently compiled binaries</title><author>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Arithmetic</topic><topic>Benchmarking</topic><topic>Binary code analysis</topic><topic>Binary codes</topic><topic>Binary system</topic><topic>Clone detection</topic><topic>Codes</topic><topic>Computer programs</topic><topic>Estimating techniques</topic><topic>Mathematical problems</topic><topic>Programming languages</topic><topic>Recall</topic><topic>Semantic clone</topic><topic>Similarity</topic><topic>Software</topic><topic>Software clone</topic><topic>Software engineering</topic><topic>Software metric</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Stojanovic, Sasa</creatorcontrib><creatorcontrib>Radivojevic, Zaharije</creatorcontrib><creatorcontrib>Cvetanovic, Milos</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Information and software technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Stojanovic, Sasa</au><au>Radivojevic, Zaharije</au><au>Cvetanovic, Milos</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Approach for estimating similarity between procedures in differently compiled binaries</atitle><jtitle>Information and software technology</jtitle><date>2015-02-01</date><risdate>2015</risdate><volume>58</volume><spage>259</spage><epage>271</epage><pages>259-271</pages><issn>0950-5849</issn><eissn>1873-6025</eissn><abstract>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate.
Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available.
The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets.
The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes.
The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision.
The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.infsof.2014.06.012</doi><tpages>13</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0950-5849 |
ispartof | Information and software technology, 2015-02, Vol.58, p.259-271 |
issn | 0950-5849 1873-6025 |
language | eng |
recordid | cdi_proquest_miscellaneous_1669894287 |
source | ScienceDirect Freedom Collection |
subjects | Arithmetic Benchmarking Binary code analysis Binary codes Binary system Clone detection Codes Computer programs Estimating techniques Mathematical problems Programming languages Recall Semantic clone Similarity Software Software clone Software engineering Software metric Studies |
title | Approach for estimating similarity between procedures in differently compiled binaries |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T14%3A25%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Approach%20for%20estimating%20similarity%20between%20procedures%20in%20differently%20compiled%20binaries&rft.jtitle=Information%20and%20software%20technology&rft.au=Stojanovic,%20Sasa&rft.date=2015-02-01&rft.volume=58&rft.spage=259&rft.epage=271&rft.pages=259-271&rft.issn=0950-5849&rft.eissn=1873-6025&rft_id=info:doi/10.1016/j.infsof.2014.06.012&rft_dat=%3Cproquest_cross%3E3510790851%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1628651972&rft_id=info:pmid/&rfr_iscdi=true |