Loading…

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection

In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, a...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information forensics and security 2013-01, Vol.8 (1), p.46-54
Main Authors: da Cruz Nassif, L. F., Hruschka, E. R.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873
cites cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873
container_end_page 54
container_issue 1
container_start_page 46
container_title IEEE transactions on information forensics and security
container_volume 8
creator da Cruz Nassif, L. F.
Hruschka, E. R.
description In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.
doi_str_mv 10.1109/TIFS.2012.2223679
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_6327658</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6327658</ieee_id><sourcerecordid>1671365936</sourcerecordid><originalsourceid>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</originalsourceid><addsrcrecordid>eNpdkF1LwzAUhoMoOKc_QLwpiODNZj7afHg3qtPCwAsneBfSLNFK29SkFfbvTd3YhVc5Ic9zzskLwCWCc4SguFsXy9c5hgjPMcaEMnEEJijL6IxCjI4PNSKn4CyELwjTFFE-Ae8PTg-Nafskr4fQG1-1H4l1Plk6b9pQ6WTRqnobqnAfq2TRdd4p_fmHFE28_IxC7ppuiHJStKEzuq9cew5OrKqDudifU_C2fFznz7PVy1ORL1YznWLcz4igYsNTaLBOCSJIpYxrSkvLjWGoLAW2CkLLsNUbhbiAm5JhlllhrS4FZ2QKbnd94y7fgwm9bKqgTV2r1rghSEQZIjQThEb0-h_65QYfvxcpzElKMh5XmAK0o7R3IXhjZeerRvmtRFCOWcsxazlmLfdZR-dm31kFrWrrVaurcBAxixrPRu5qx1XGmMMzJZjROPsXcaaHQw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1283435831</pqid></control><display><type>article</type><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><source>IEEE Xplore (Online service)</source><creator>da Cruz Nassif, L. F. ; Hruschka, E. R.</creator><creatorcontrib>da Cruz Nassif, L. F. ; Hruschka, E. R.</creatorcontrib><description>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</description><identifier>ISSN: 1556-6013</identifier><identifier>EISSN: 1556-6021</identifier><identifier>DOI: 10.1109/TIFS.2012.2223679</identifier><identifier>CODEN: ITIFA6</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Algorithm design and analysis ; Algorithms ; Applied sciences ; Artificial intelligence ; Clustering ; Clustering algorithms ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Digital forensics ; Exact sciences and technology ; Forensic computing ; Forensic engineering ; Information systems. Data bases ; Inspection ; Links ; Memory and file management (including protection and security) ; Memory organisation. Data processing ; Pattern clustering ; Seizing ; Software ; Speech and sound recognition and synthesis. Linguistics ; Studies ; Text analysis ; Text mining ; Texts</subject><ispartof>IEEE transactions on information forensics and security, 2013-01, Vol.8 (1), p.46-54</ispartof><rights>2014 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2013</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</citedby><cites>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6327658$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,4024,27923,27924,27925,54796</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=27109859$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>da Cruz Nassif, L. F.</creatorcontrib><creatorcontrib>Hruschka, E. R.</creatorcontrib><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><title>IEEE transactions on information forensics and security</title><addtitle>TIFS</addtitle><description>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</description><subject>Algorithm design and analysis</subject><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Clustering</subject><subject>Clustering algorithms</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Digital forensics</subject><subject>Exact sciences and technology</subject><subject>Forensic computing</subject><subject>Forensic engineering</subject><subject>Information systems. Data bases</subject><subject>Inspection</subject><subject>Links</subject><subject>Memory and file management (including protection and security)</subject><subject>Memory organisation. Data processing</subject><subject>Pattern clustering</subject><subject>Seizing</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Studies</subject><subject>Text analysis</subject><subject>Text mining</subject><subject>Texts</subject><issn>1556-6013</issn><issn>1556-6021</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNpdkF1LwzAUhoMoOKc_QLwpiODNZj7afHg3qtPCwAsneBfSLNFK29SkFfbvTd3YhVc5Ic9zzskLwCWCc4SguFsXy9c5hgjPMcaEMnEEJijL6IxCjI4PNSKn4CyELwjTFFE-Ae8PTg-Nafskr4fQG1-1H4l1Plk6b9pQ6WTRqnobqnAfq2TRdd4p_fmHFE28_IxC7ppuiHJStKEzuq9cew5OrKqDudifU_C2fFznz7PVy1ORL1YznWLcz4igYsNTaLBOCSJIpYxrSkvLjWGoLAW2CkLLsNUbhbiAm5JhlllhrS4FZ2QKbnd94y7fgwm9bKqgTV2r1rghSEQZIjQThEb0-h_65QYfvxcpzElKMh5XmAK0o7R3IXhjZeerRvmtRFCOWcsxazlmLfdZR-dm31kFrWrrVaurcBAxixrPRu5qx1XGmMMzJZjROPsXcaaHQw</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>da Cruz Nassif, L. F.</creator><creator>Hruschka, E. R.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope></search><sort><creationdate>201301</creationdate><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><author>da Cruz Nassif, L. F. ; Hruschka, E. R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithm design and analysis</topic><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Clustering</topic><topic>Clustering algorithms</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Digital forensics</topic><topic>Exact sciences and technology</topic><topic>Forensic computing</topic><topic>Forensic engineering</topic><topic>Information systems. Data bases</topic><topic>Inspection</topic><topic>Links</topic><topic>Memory and file management (including protection and security)</topic><topic>Memory organisation. Data processing</topic><topic>Pattern clustering</topic><topic>Seizing</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Studies</topic><topic>Text analysis</topic><topic>Text mining</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>da Cruz Nassif, L. F.</creatorcontrib><creatorcontrib>Hruschka, E. R.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><jtitle>IEEE transactions on information forensics and security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>da Cruz Nassif, L. F.</au><au>Hruschka, E. R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</atitle><jtitle>IEEE transactions on information forensics and security</jtitle><stitle>TIFS</stitle><date>2013-01</date><risdate>2013</risdate><volume>8</volume><issue>1</issue><spage>46</spage><epage>54</epage><pages>46-54</pages><issn>1556-6013</issn><eissn>1556-6021</eissn><coden>ITIFA6</coden><abstract>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TIFS.2012.2223679</doi><tpages>9</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1556-6013
ispartof IEEE transactions on information forensics and security, 2013-01, Vol.8 (1), p.46-54
issn 1556-6013
1556-6021
language eng
recordid cdi_ieee_primary_6327658
source IEEE Xplore (Online service)
subjects Algorithm design and analysis
Algorithms
Applied sciences
Artificial intelligence
Clustering
Clustering algorithms
Computer science
control theory
systems
Data processing. List processing. Character string processing
Digital forensics
Exact sciences and technology
Forensic computing
Forensic engineering
Information systems. Data bases
Inspection
Links
Memory and file management (including protection and security)
Memory organisation. Data processing
Pattern clustering
Seizing
Software
Speech and sound recognition and synthesis. Linguistics
Studies
Text analysis
Text mining
Texts
title Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T02%3A05%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Document%20Clustering%20for%20Forensic%20Analysis:%20An%20Approach%20for%20Improving%20Computer%20Inspection&rft.jtitle=IEEE%20transactions%20on%20information%20forensics%20and%20security&rft.au=da%20Cruz%20Nassif,%20L.%20F.&rft.date=2013-01&rft.volume=8&rft.issue=1&rft.spage=46&rft.epage=54&rft.pages=46-54&rft.issn=1556-6013&rft.eissn=1556-6021&rft.coden=ITIFA6&rft_id=info:doi/10.1109/TIFS.2012.2223679&rft_dat=%3Cproquest_ieee_%3E1671365936%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1283435831&rft_id=info:pmid/&rft_ieee_id=6327658&rfr_iscdi=true