Loading…
Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach
Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tam...
Saved in:
Published in: | Technical review - IETE 2010-11, Vol.27 (6), p.465 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83 |
---|---|
cites | cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83 |
container_end_page | |
container_issue | 6 |
container_start_page | 465 |
container_title | Technical review - IETE |
container_volume | 27 |
creator | Ahmed, Irfan Lhee, Kyung-suk Shin, Hyunjung Hong, ManPyo |
description | Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies. |
doi_str_mv | 10.4103/0256-4602.67149 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_850423295</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2261362661</sourcerecordid><originalsourceid>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</originalsourceid><addsrcrecordid>eNo1UE1PAjEUbIwmInr22ngvtN1-bI9kFSQh8aCca-m-1RLYXdvFhH9vET3N-5jMvDcI3TM6EYwWU8qlIkJRPlGaCXOBRtRoQaTR7DLX_9trdJPSllIluGQj9F517QDtQDYuQY3nYQdkOPaAl3WehiZ4N4SuxesU2g9cdRkAv4Z92LkYhiN2bY0dfgzfoQaSG5L1vg4Q8azvY-f85y26atwuwd0fjtF6_vRWPZPVy2JZzVbEc1UOxCtRMw260YyXJddFqZnemEYabjQ0ipeFMT4_IF0B0gHVmimQigP3wJqyGKOHs262zQekwW67Q2yzpS0lFbzgRmbS9EzysUspQmP7GPYuHi2j9pSiPeVkTznZ3xSLH4OOY4M</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>850423295</pqid></control><display><type>article</type><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><source>Taylor and Francis:Jisc Collections:Taylor and Francis Read and Publish Agreement 2024-2025:Science and Technology Collection (Reading list)</source><creator>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</creator><creatorcontrib>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</creatorcontrib><description>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</description><identifier>ISSN: 0256-4602</identifier><identifier>EISSN: 0974-5971</identifier><identifier>DOI: 10.4103/0256-4602.67149</identifier><language>eng</language><publisher>New Delhi: Taylor & Francis Ltd</publisher><subject>Experiments ; Frequency distribution ; Methods ; Neural networks ; Operating systems ; Software reviews</subject><ispartof>Technical review - IETE, 2010-11, Vol.27 (6), p.465</ispartof><rights>Copyright Medknow Publications & Media Pvt. Ltd. Nov 2010</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</citedby><cites>FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Ahmed, Irfan</creatorcontrib><creatorcontrib>Lhee, Kyung-suk</creatorcontrib><creatorcontrib>Shin, Hyunjung</creatorcontrib><creatorcontrib>Hong, ManPyo</creatorcontrib><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><title>Technical review - IETE</title><description>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</description><subject>Experiments</subject><subject>Frequency distribution</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Operating systems</subject><subject>Software reviews</subject><issn>0256-4602</issn><issn>0974-5971</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><recordid>eNo1UE1PAjEUbIwmInr22ngvtN1-bI9kFSQh8aCca-m-1RLYXdvFhH9vET3N-5jMvDcI3TM6EYwWU8qlIkJRPlGaCXOBRtRoQaTR7DLX_9trdJPSllIluGQj9F517QDtQDYuQY3nYQdkOPaAl3WehiZ4N4SuxesU2g9cdRkAv4Z92LkYhiN2bY0dfgzfoQaSG5L1vg4Q8azvY-f85y26atwuwd0fjtF6_vRWPZPVy2JZzVbEc1UOxCtRMw260YyXJddFqZnemEYabjQ0ipeFMT4_IF0B0gHVmimQigP3wJqyGKOHs262zQekwW67Q2yzpS0lFbzgRmbS9EzysUspQmP7GPYuHi2j9pSiPeVkTznZ3xSLH4OOY4M</recordid><startdate>20101101</startdate><enddate>20101101</enddate><creator>Ahmed, Irfan</creator><creator>Lhee, Kyung-suk</creator><creator>Shin, Hyunjung</creator><creator>Hong, ManPyo</creator><general>Taylor & Francis Ltd</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20101101</creationdate><title>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</title><author>Ahmed, Irfan ; Lhee, Kyung-suk ; Shin, Hyunjung ; Hong, ManPyo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Experiments</topic><topic>Frequency distribution</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Operating systems</topic><topic>Software reviews</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ahmed, Irfan</creatorcontrib><creatorcontrib>Lhee, Kyung-suk</creatorcontrib><creatorcontrib>Shin, Hyunjung</creatorcontrib><creatorcontrib>Hong, ManPyo</creatorcontrib><collection>CrossRef</collection><jtitle>Technical review - IETE</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ahmed, Irfan</au><au>Lhee, Kyung-suk</au><au>Shin, Hyunjung</au><au>Hong, ManPyo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach</atitle><jtitle>Technical review - IETE</jtitle><date>2010-11-01</date><risdate>2010</risdate><volume>27</volume><issue>6</issue><spage>465</spage><pages>465-</pages><issn>0256-4602</issn><eissn>0974-5971</eissn><abstract>Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies.</abstract><cop>New Delhi</cop><pub>Taylor & Francis Ltd</pub><doi>10.4103/0256-4602.67149</doi></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0256-4602 |
ispartof | Technical review - IETE, 2010-11, Vol.27 (6), p.465 |
issn | 0256-4602 0974-5971 |
language | eng |
recordid | cdi_proquest_journals_850423295 |
source | Taylor and Francis:Jisc Collections:Taylor and Francis Read and Publish Agreement 2024-2025:Science and Technology Collection (Reading list) |
subjects | Experiments Frequency distribution Methods Neural networks Operating systems Software reviews |
title | Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T01%3A52%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Content-based%20File-type%20Identification%20Using%20Cosine%20Similarity%20and%20a%20Divide-and-Conquer%20Approach&rft.jtitle=Technical%20review%20-%20IETE&rft.au=Ahmed,%20Irfan&rft.date=2010-11-01&rft.volume=27&rft.issue=6&rft.spage=465&rft.pages=465-&rft.issn=0256-4602&rft.eissn=0974-5971&rft_id=info:doi/10.4103/0256-4602.67149&rft_dat=%3Cproquest_cross%3E2261362661%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c268t-c64d17e7f712882738717b9f59297ef628399c4605a3e5ae07716e562e2ce1f83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=850423295&rft_id=info:pmid/&rfr_iscdi=true |