Loading…
Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach
Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tam...
Saved in:
Published in: | Technical review - IETE 2010-11, Vol.27 (6), p.465 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Identifying the file type (TXT, EXE, JPEG, etc.) is important for computer security applications such as computer forensics, steganalysis, and antivirus programs. The common approach for this is to use file extensions, magic numbers, or other header information. However, these are susceptible to tampering or corruption; for instance, the file extension can be easily spoofed and the magic numbers can be obfuscated. A more reliable approach may be to analyze the file content instead of using only the tip of the information (metadata). This paper proposes two methods based on the file content. First, we use the cosine distance as a similarity metric when comparing the file content rather than the Mahalanobis distance that is popular and has been used by the other related approaches. The cosine similarity (unlike the Mahalanobis distance) retains the classification accuracy on a small number of highly frequent byte patterns which leads to a smaller model size and faster detection rate. Second, we decompose the identification procedure into two steps by taking the divide and conquer: in the first step, the similar files in terms of byte pattern frequencies are grouped into several clusters. In the next step, the cluster which contains different file types is fed to the neural network in order for finer classification. The experiments showed that the classification followed by clustering leads to higher accuracies. |
---|---|
ISSN: | 0256-4602 0974-5971 |
DOI: | 10.4103/0256-4602.67149 |