Loading…
Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection
In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, a...
Saved in:
Published in: | IEEE transactions on information forensics and security 2013-01, Vol.8 (1), p.46-54 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873 |
---|---|
cites | cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873 |
container_end_page | 54 |
container_issue | 1 |
container_start_page | 46 |
container_title | IEEE transactions on information forensics and security |
container_volume | 8 |
creator | da Cruz Nassif, L. F. Hruschka, E. R. |
description | In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing. |
doi_str_mv | 10.1109/TIFS.2012.2223679 |
format | article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_6327658</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6327658</ieee_id><sourcerecordid>1671365936</sourcerecordid><originalsourceid>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</originalsourceid><addsrcrecordid>eNpdkF1LwzAUhoMoOKc_QLwpiODNZj7afHg3qtPCwAsneBfSLNFK29SkFfbvTd3YhVc5Ic9zzskLwCWCc4SguFsXy9c5hgjPMcaEMnEEJijL6IxCjI4PNSKn4CyELwjTFFE-Ae8PTg-Nafskr4fQG1-1H4l1Plk6b9pQ6WTRqnobqnAfq2TRdd4p_fmHFE28_IxC7ppuiHJStKEzuq9cew5OrKqDudifU_C2fFznz7PVy1ORL1YznWLcz4igYsNTaLBOCSJIpYxrSkvLjWGoLAW2CkLLsNUbhbiAm5JhlllhrS4FZ2QKbnd94y7fgwm9bKqgTV2r1rghSEQZIjQThEb0-h_65QYfvxcpzElKMh5XmAK0o7R3IXhjZeerRvmtRFCOWcsxazlmLfdZR-dm31kFrWrrVaurcBAxixrPRu5qx1XGmMMzJZjROPsXcaaHQw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1283435831</pqid></control><display><type>article</type><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><source>IEEE Xplore (Online service)</source><creator>da Cruz Nassif, L. F. ; Hruschka, E. R.</creator><creatorcontrib>da Cruz Nassif, L. F. ; Hruschka, E. R.</creatorcontrib><description>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</description><identifier>ISSN: 1556-6013</identifier><identifier>EISSN: 1556-6021</identifier><identifier>DOI: 10.1109/TIFS.2012.2223679</identifier><identifier>CODEN: ITIFA6</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Algorithm design and analysis ; Algorithms ; Applied sciences ; Artificial intelligence ; Clustering ; Clustering algorithms ; Computer science; control theory; systems ; Data processing. List processing. Character string processing ; Digital forensics ; Exact sciences and technology ; Forensic computing ; Forensic engineering ; Information systems. Data bases ; Inspection ; Links ; Memory and file management (including protection and security) ; Memory organisation. Data processing ; Pattern clustering ; Seizing ; Software ; Speech and sound recognition and synthesis. Linguistics ; Studies ; Text analysis ; Text mining ; Texts</subject><ispartof>IEEE transactions on information forensics and security, 2013-01, Vol.8 (1), p.46-54</ispartof><rights>2014 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2013</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</citedby><cites>FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6327658$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,4024,27923,27924,27925,54796</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=27109859$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>da Cruz Nassif, L. F.</creatorcontrib><creatorcontrib>Hruschka, E. R.</creatorcontrib><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><title>IEEE transactions on information forensics and security</title><addtitle>TIFS</addtitle><description>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</description><subject>Algorithm design and analysis</subject><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Artificial intelligence</subject><subject>Clustering</subject><subject>Clustering algorithms</subject><subject>Computer science; control theory; systems</subject><subject>Data processing. List processing. Character string processing</subject><subject>Digital forensics</subject><subject>Exact sciences and technology</subject><subject>Forensic computing</subject><subject>Forensic engineering</subject><subject>Information systems. Data bases</subject><subject>Inspection</subject><subject>Links</subject><subject>Memory and file management (including protection and security)</subject><subject>Memory organisation. Data processing</subject><subject>Pattern clustering</subject><subject>Seizing</subject><subject>Software</subject><subject>Speech and sound recognition and synthesis. Linguistics</subject><subject>Studies</subject><subject>Text analysis</subject><subject>Text mining</subject><subject>Texts</subject><issn>1556-6013</issn><issn>1556-6021</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><recordid>eNpdkF1LwzAUhoMoOKc_QLwpiODNZj7afHg3qtPCwAsneBfSLNFK29SkFfbvTd3YhVc5Ic9zzskLwCWCc4SguFsXy9c5hgjPMcaEMnEEJijL6IxCjI4PNSKn4CyELwjTFFE-Ae8PTg-Nafskr4fQG1-1H4l1Plk6b9pQ6WTRqnobqnAfq2TRdd4p_fmHFE28_IxC7ppuiHJStKEzuq9cew5OrKqDudifU_C2fFznz7PVy1ORL1YznWLcz4igYsNTaLBOCSJIpYxrSkvLjWGoLAW2CkLLsNUbhbiAm5JhlllhrS4FZ2QKbnd94y7fgwm9bKqgTV2r1rghSEQZIjQThEb0-h_65QYfvxcpzElKMh5XmAK0o7R3IXhjZeerRvmtRFCOWcsxazlmLfdZR-dm31kFrWrrVaurcBAxixrPRu5qx1XGmMMzJZjROPsXcaaHQw</recordid><startdate>201301</startdate><enddate>201301</enddate><creator>da Cruz Nassif, L. F.</creator><creator>Hruschka, E. R.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope></search><sort><creationdate>201301</creationdate><title>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</title><author>da Cruz Nassif, L. F. ; Hruschka, E. R.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Algorithm design and analysis</topic><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Artificial intelligence</topic><topic>Clustering</topic><topic>Clustering algorithms</topic><topic>Computer science; control theory; systems</topic><topic>Data processing. List processing. Character string processing</topic><topic>Digital forensics</topic><topic>Exact sciences and technology</topic><topic>Forensic computing</topic><topic>Forensic engineering</topic><topic>Information systems. Data bases</topic><topic>Inspection</topic><topic>Links</topic><topic>Memory and file management (including protection and security)</topic><topic>Memory organisation. Data processing</topic><topic>Pattern clustering</topic><topic>Seizing</topic><topic>Software</topic><topic>Speech and sound recognition and synthesis. Linguistics</topic><topic>Studies</topic><topic>Text analysis</topic><topic>Text mining</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>da Cruz Nassif, L. F.</creatorcontrib><creatorcontrib>Hruschka, E. R.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library Online</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><jtitle>IEEE transactions on information forensics and security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>da Cruz Nassif, L. F.</au><au>Hruschka, E. R.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection</atitle><jtitle>IEEE transactions on information forensics and security</jtitle><stitle>TIFS</stitle><date>2013-01</date><risdate>2013</risdate><volume>8</volume><issue>1</issue><spage>46</spage><epage>54</epage><pages>46-54</pages><issn>1556-6013</issn><eissn>1556-6021</eissn><coden>ITIFA6</coden><abstract>In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In this context, automated methods of analysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the proposed approach by carrying out extensive experimentation with six well-known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations. Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algorithms provide the best results for our application domain. If suitably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and discuss several practical results that can be useful for researchers and practitioners of forensic computing.</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TIFS.2012.2223679</doi><tpages>9</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1556-6013 |
ispartof | IEEE transactions on information forensics and security, 2013-01, Vol.8 (1), p.46-54 |
issn | 1556-6013 1556-6021 |
language | eng |
recordid | cdi_ieee_primary_6327658 |
source | IEEE Xplore (Online service) |
subjects | Algorithm design and analysis Algorithms Applied sciences Artificial intelligence Clustering Clustering algorithms Computer science control theory systems Data processing. List processing. Character string processing Digital forensics Exact sciences and technology Forensic computing Forensic engineering Information systems. Data bases Inspection Links Memory and file management (including protection and security) Memory organisation. Data processing Pattern clustering Seizing Software Speech and sound recognition and synthesis. Linguistics Studies Text analysis Text mining Texts |
title | Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T02%3A05%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Document%20Clustering%20for%20Forensic%20Analysis:%20An%20Approach%20for%20Improving%20Computer%20Inspection&rft.jtitle=IEEE%20transactions%20on%20information%20forensics%20and%20security&rft.au=da%20Cruz%20Nassif,%20L.%20F.&rft.date=2013-01&rft.volume=8&rft.issue=1&rft.spage=46&rft.epage=54&rft.pages=46-54&rft.issn=1556-6013&rft.eissn=1556-6021&rft.coden=ITIFA6&rft_id=info:doi/10.1109/TIFS.2012.2223679&rft_dat=%3Cproquest_ieee_%3E1671365936%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c422t-3969d840e2c43131a478c66bf8ee71bb92fa00f72fcda1890db7275f9ffcb9873%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1283435831&rft_id=info:pmid/&rft_ieee_id=6327658&rfr_iscdi=true |