Loading…

Approach for estimating similarity between procedures in differently compiled binaries

•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success ra...

Full description

Saved in:
Bibliographic Details
Published in:Information and software technology 2015-02, Vol.58, p.259-271
Main Authors: Stojanovic, Sasa, Radivojevic, Zaharije, Cvetanovic, Milos
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03
cites cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03
container_end_page 271
container_issue
container_start_page 259
container_title Information and software technology
container_volume 58
creator Stojanovic, Sasa
Radivojevic, Zaharije
Cvetanovic, Milos
description •New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate. Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available. The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets. The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes. The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision. The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.
doi_str_mv 10.1016/j.infsof.2014.06.012
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1669894287</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0950584914001517</els_id><sourcerecordid>3510790851</sourcerecordid><originalsourceid>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</originalsourceid><addsrcrecordid>eNp9kD1PwzAURS0EEqXwDxgssbAkPDuJ4yxIVcWXVIkFWK3EeQZHiVPsBNR_j6syMTC95dyr-w4hlwxSBkzcdKl1Jowm5cDyFEQKjB-RBZNllgjgxTFZQFVAUsi8OiVnIXQArIQMFuRttd36sdYf1IyeYpjsUE_WvdNgB9vX3k472uD0jehoBDW2s8dAraOtNQY9uqnfUT0OW9tjSxvrYgbDOTkxdR_w4vcuyev93cv6Mdk8PzytV5tE5wBT0tR5mQmUnAlZsrZB5LrGEmQuS80yIXTTgjAgRCMFZBkUTPKclSYDaBsD2ZJcH3rjts85zleDDRr7vnY4zkExISpZ5TyaWJKrP2g3zt7FdZHiUhSsKnmk8gOl_RiCR6O2PirxO8VA7WWrTh1kq71sBUJF2TF2e4hhfPbLoldBW3RRl_WoJ9WO9v-CH0u8ifY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1628651972</pqid></control><display><type>article</type><title>Approach for estimating similarity between procedures in differently compiled binaries</title><source>ScienceDirect Freedom Collection</source><creator>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</creator><creatorcontrib>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</creatorcontrib><description>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate. Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available. The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets. The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes. The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision. The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</description><identifier>ISSN: 0950-5849</identifier><identifier>EISSN: 1873-6025</identifier><identifier>DOI: 10.1016/j.infsof.2014.06.012</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Arithmetic ; Benchmarking ; Binary code analysis ; Binary codes ; Binary system ; Clone detection ; Codes ; Computer programs ; Estimating techniques ; Mathematical problems ; Programming languages ; Recall ; Semantic clone ; Similarity ; Software ; Software clone ; Software engineering ; Software metric ; Studies</subject><ispartof>Information and software technology, 2015-02, Vol.58, p.259-271</ispartof><rights>2014 Elsevier B.V.</rights><rights>Copyright Elsevier Science Ltd. Feb 2015</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</citedby><cites>FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Stojanovic, Sasa</creatorcontrib><creatorcontrib>Radivojevic, Zaharije</creatorcontrib><creatorcontrib>Cvetanovic, Milos</creatorcontrib><title>Approach for estimating similarity between procedures in differently compiled binaries</title><title>Information and software technology</title><description>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate. Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available. The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets. The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes. The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision. The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</description><subject>Arithmetic</subject><subject>Benchmarking</subject><subject>Binary code analysis</subject><subject>Binary codes</subject><subject>Binary system</subject><subject>Clone detection</subject><subject>Codes</subject><subject>Computer programs</subject><subject>Estimating techniques</subject><subject>Mathematical problems</subject><subject>Programming languages</subject><subject>Recall</subject><subject>Semantic clone</subject><subject>Similarity</subject><subject>Software</subject><subject>Software clone</subject><subject>Software engineering</subject><subject>Software metric</subject><subject>Studies</subject><issn>0950-5849</issn><issn>1873-6025</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNp9kD1PwzAURS0EEqXwDxgssbAkPDuJ4yxIVcWXVIkFWK3EeQZHiVPsBNR_j6syMTC95dyr-w4hlwxSBkzcdKl1Jowm5cDyFEQKjB-RBZNllgjgxTFZQFVAUsi8OiVnIXQArIQMFuRttd36sdYf1IyeYpjsUE_WvdNgB9vX3k472uD0jehoBDW2s8dAraOtNQY9uqnfUT0OW9tjSxvrYgbDOTkxdR_w4vcuyev93cv6Mdk8PzytV5tE5wBT0tR5mQmUnAlZsrZB5LrGEmQuS80yIXTTgjAgRCMFZBkUTPKclSYDaBsD2ZJcH3rjts85zleDDRr7vnY4zkExISpZ5TyaWJKrP2g3zt7FdZHiUhSsKnmk8gOl_RiCR6O2PirxO8VA7WWrTh1kq71sBUJF2TF2e4hhfPbLoldBW3RRl_WoJ9WO9v-CH0u8ifY</recordid><startdate>20150201</startdate><enddate>20150201</enddate><creator>Stojanovic, Sasa</creator><creator>Radivojevic, Zaharije</creator><creator>Cvetanovic, Milos</creator><general>Elsevier B.V</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20150201</creationdate><title>Approach for estimating similarity between procedures in differently compiled binaries</title><author>Stojanovic, Sasa ; Radivojevic, Zaharije ; Cvetanovic, Milos</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Arithmetic</topic><topic>Benchmarking</topic><topic>Binary code analysis</topic><topic>Binary codes</topic><topic>Binary system</topic><topic>Clone detection</topic><topic>Codes</topic><topic>Computer programs</topic><topic>Estimating techniques</topic><topic>Mathematical problems</topic><topic>Programming languages</topic><topic>Recall</topic><topic>Semantic clone</topic><topic>Similarity</topic><topic>Software</topic><topic>Software clone</topic><topic>Software engineering</topic><topic>Software metric</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Stojanovic, Sasa</creatorcontrib><creatorcontrib>Radivojevic, Zaharije</creatorcontrib><creatorcontrib>Cvetanovic, Milos</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Information and software technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Stojanovic, Sasa</au><au>Radivojevic, Zaharije</au><au>Cvetanovic, Milos</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Approach for estimating similarity between procedures in differently compiled binaries</atitle><jtitle>Information and software technology</jtitle><date>2015-02-01</date><risdate>2015</risdate><volume>58</volume><spage>259</spage><epage>271</epage><pages>259-271</pages><issn>0950-5849</issn><eissn>1873-6025</eissn><abstract>•New software metrics based approach for detecting software clones in binary codes.•Use of new metrics immune to syntactical changes introduced by different compilers.•Evaluation of different ways of combining extracted metrics.•Knowledge about used compiler brings up to 2.28 times better success rate. Detection of an unauthorized use of a software library is a clone detection problem that in case of commercial products has additional complexity due to the fact that only binary code is available. The goal of this paper is to propose an approach for estimating the level of similarity between the procedures originating from different binary codes. The assumption is that the clones in the binary codes come from the use of a common software library that may be compiled with different toolsets. The approach uses a set of software metrics adapted from the high level languages and it also extends the set with new metrics that take into account syntactical changes that are introduced by the usage of different toolsets and optimizations. Moreover, the approach compares metric values and introduces transformers and formulas that can use training data for production of measure of similarities between the two procedures in binary codes. The approach has been evaluated on programs from STAMP benchmark and BusyBox tool, compiled with different toolsets in different modes. The experiments with programs from STAMP benchmark show that detecting the same procedures recall can be up to 1.44 times higher using new metrics. Knowledge about the used compiling toolset can bring up to 2.28 times improvement in recall. The experiment with BusyBox tool shows 43% recall for 43% precision. The most useful newly proposed metrics are those that consider the frequency of arithmetic instructions, the number and frequency of occurrences for instructions, and the number of occurrences for target addresses in calls. The best way to combine the results of comparing metrics is to use a geometric mean or when previous knowledge is available, to use an arithmetic mean with appropriate transformer.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.infsof.2014.06.012</doi><tpages>13</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0950-5849
ispartof Information and software technology, 2015-02, Vol.58, p.259-271
issn 0950-5849
1873-6025
language eng
recordid cdi_proquest_miscellaneous_1669894287
source ScienceDirect Freedom Collection
subjects Arithmetic
Benchmarking
Binary code analysis
Binary codes
Binary system
Clone detection
Codes
Computer programs
Estimating techniques
Mathematical problems
Programming languages
Recall
Semantic clone
Similarity
Software
Software clone
Software engineering
Software metric
Studies
title Approach for estimating similarity between procedures in differently compiled binaries
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T14%3A25%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Approach%20for%20estimating%20similarity%20between%20procedures%20in%20differently%20compiled%20binaries&rft.jtitle=Information%20and%20software%20technology&rft.au=Stojanovic,%20Sasa&rft.date=2015-02-01&rft.volume=58&rft.spage=259&rft.epage=271&rft.pages=259-271&rft.issn=0950-5849&rft.eissn=1873-6025&rft_id=info:doi/10.1016/j.infsof.2014.06.012&rft_dat=%3Cproquest_cross%3E3510790851%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c400t-ba4736e8216871dbee2cae708487c1366cbd06f066b8603305182417f300dbf03%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1628651972&rft_id=info:pmid/&rfr_iscdi=true