Loading…

Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach

Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tam...

Full description

Saved in:
Bibliographic Details
Published in:Technical review - IETE 2010-11, Vol.27 (6), p.465
Main Authors: Ahmed, Irfan, Lhee, Kyung-suk, Shin, Hyunjung, Hong, ManPyo
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83
cites cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83
container_end_page
container_issue 6
container_start_page 465
container_title Technical review - IETE
container_volume 27
creator Ahmed, Irfan
Lhee, Kyung-suk
Shin, Hyunjung
Hong, ManPyo
description Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.
doi_str_mv 10.4103/0256-4602.67149
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_850423295</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2261362661</sourcerecordid><originalsourceid>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</originalsourceid><addsrcrecordid>eNo1UE1PAjEUbIwmInr22ngvtN1-bI9kFSQh8aCca-m-1RLYXdvFhH9vET3N-5jMvDcI3TM6EYwWU8qlIkJRPlGaCXOBRtRoQaTR7DLX_9trdJPSllIluGQj9F517QDtQDYuQY3nYQdkOPaAl3WehiZ4N4SuxesU2g9cdRkAv4Z92LkYhiN2bY0dfgzfoQaSG5L1vg4Q8azvY-f85y26atwuwd0fjtF6_vRWPZPVy2JZzVbEc1UOxCtRMw260YyXJddFqZnemEYabjQ0ipeFMT4_IF0B0gHVmimQigP3wJqyGKOHs262zQekwW67Q2yzpS0lFbzgRmbS9EzysUspQmP7GPYuHi2j9pSiPeVkTznZ3xSLH4OOY4M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>850423295</pqid></control><display><type>article</type><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><source>Taylor and Francis:Jisc Collections:Taylor and Francis Read and Publish Agreement 2024-2025:Science and Technology Collection (Reading list)</source><creator>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</creator><creatorcontrib>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</creatorcontrib><description>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</description><identifier>ISSN: 0256-4602</identifier><identifier>EISSN: 0974-5971</identifier><identifier>DOI: 10.4103/0256-4602.67149</identifier><language>eng</language><publisher>New Delhi: Taylor &amp; Francis Ltd</publisher><subject>Experiments ; Frequency distribution ; Methods ; Neural networks ; Operating systems ; Software reviews</subject><ispartof>Technical review - IETE, 2010-11, Vol.27 (6), p.465</ispartof><rights>Copyright Medknow Publications &amp; Media Pvt. Ltd. Nov 2010</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</citedby><cites>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Ahmed, Irfan</creatorcontrib><creatorcontrib>Lhee, Kyung-suk</creatorcontrib><creatorcontrib>Shin, Hyunjung</creatorcontrib><creatorcontrib>Hong, ManPyo</creatorcontrib><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><title>Technical review - IETE</title><description>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</description><subject>Experiments</subject><subject>Frequency distribution</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Operating systems</subject><subject>Software reviews</subject><issn>0256-4602</issn><issn>0974-5971</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><recordid>eNo1UE1PAjEUbIwmInr22ngvtN1-bI9kFSQh8aCca-m-1RLYXdvFhH9vET3N-5jMvDcI3TM6EYwWU8qlIkJRPlGaCXOBRtRoQaTR7DLX_9trdJPSllIluGQj9F517QDtQDYuQY3nYQdkOPaAl3WehiZ4N4SuxesU2g9cdRkAv4Z92LkYhiN2bY0dfgzfoQaSG5L1vg4Q8azvY-f85y26atwuwd0fjtF6_vRWPZPVy2JZzVbEc1UOxCtRMw260YyXJddFqZnemEYabjQ0ipeFMT4_IF0B0gHVmimQigP3wJqyGKOHs262zQekwW67Q2yzpS0lFbzgRmbS9EzysUspQmP7GPYuHi2j9pSiPeVkTznZ3xSLH4OOY4M</recordid><startdate>20101101</startdate><enddate>20101101</enddate><creator>Ahmed, Irfan</creator><creator>Lhee, Kyung-suk</creator><creator>Shin, Hyunjung</creator><creator>Hong, ManPyo</creator><general>Taylor &amp; Francis Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20101101</creationdate><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><author>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Experiments</topic><topic>Frequency distribution</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Operating systems</topic><topic>Software reviews</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ahmed, Irfan</creatorcontrib><creatorcontrib>Lhee, Kyung-suk</creatorcontrib><creatorcontrib>Shin, Hyunjung</creatorcontrib><creatorcontrib>Hong, ManPyo</creatorcontrib><collection>CrossRef</collection><jtitle>Technical review - IETE</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahmed, Irfan</au><au>Lhee, Kyung-suk</au><au>Shin, Hyunjung</au><au>Hong, ManPyo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</atitle><jtitle>Technical review - IETE</jtitle><date>2010-11-01</date><risdate>2010</risdate><volume>27</volume><issue>6</issue><spage>465</spage><pages>465-</pages><issn>0256-4602</issn><eissn>0974-5971</eissn><abstract>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</abstract><cop>New Delhi</cop><pub>Taylor &amp; Francis Ltd</pub><doi>10.4103/0256-4602.67149</doi></addata></record>
fulltext fulltext
identifier ISSN: 0256-4602
ispartof Technical review - IETE, 2010-11, Vol.27 (6), p.465
issn 0256-4602
0974-5971
language eng
recordid cdi_proquest_journals_850423295
source Taylor and Francis:Jisc Collections:Taylor and Francis Read and Publish Agreement 2024-2025:Science and Technology Collection (Reading list)
subjects Experiments
Frequency distribution
Methods
Neural networks
Operating systems
Software reviews
title Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T01%3A52%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Content-based%20File-type%20Identification%20Using%20Cosine%20Similarity%20and%20a%20Divide-and-Conquer%20Approach&rft.jtitle=Technical%20review%20-%20IETE&rft.au=Ahmed,%20Irfan&rft.date=2010-11-01&rft.volume=27&rft.issue=6&rft.spage=465&rft.pages=465-&rft.issn=0256-4602&rft.eissn=0974-5971&rft_id=info:doi/10.4103/0256-4602.67149&rft_dat=%3Cproquest_cross%3E2261362661%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=850423295&rft_id=info:pmid/&rfr_iscdi=true