Loading…
An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms...
Saved in:
Published in: | Information processing & management 2020-03, Vol.57 (2), p.102034, Article 102034 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593 |
---|---|
cites | cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593 |
container_end_page | |
container_issue | 2 |
container_start_page | 102034 |
container_title | Information processing & management |
container_volume | 57 |
creator | Curiskis, Stephan A. Drake, Barry Osborn, Thomas R. Kennedy, Paul J. |
description | •Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures. |
doi_str_mv | 10.1016/j.ipm.2019.04.002 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2354808283</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457318307805</els_id><sourcerecordid>2354808283</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</originalsourceid><addsrcrecordid>eNp9kE9r3DAQxUVoINs0HyA3Qc92Rn-8ttvTsrRNIBAom7OQpXGQ45W2krwh3z7abs-9zDCP95sZHiG3DGoGbH031e6wrzmwvgZZA_ALsmJdK6pGtOwTWYGAdSWbVlyRzylNACAbxldk2niKRz0vOrvgaRipDWbZo8_UzEvKGJ1_odpbmsPBGboPFuf5pDlP81ugwZcJaQrG6Zl6LFp8Td_o7s3lQv9Ff6O1Ln8hl6OeE97869fk-eeP3fa-enz69bDdPFZG8CZXUqJcj6XCMLKm53bQCJybFnomORMcTY9D0-rBim7Nx2FswbaScWu6Tje9uCZfz3sPMfxZMGU1hSX6clJx0cgOOt6J4mJnl4khpYijOkS31_FdMVCnSNWkSqTqFKkCqUqkhfl-ZrC8f3QYVTIOvUHrIpqsbHD_oT8A2NZ_rQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2354808283</pqid></control><display><type>article</type><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><source>Library & Information Science Abstracts (LISA)</source><source>ScienceDirect Journals</source><creator>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</creator><creatorcontrib>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</creatorcontrib><description>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2019.04.002</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Clustering ; Data mining ; Data models ; Datasets ; Dirichlet problem ; Distance measurement ; Document clustering ; Embedding ; Embedding models ; Modelling ; Neural networks ; Online social networks ; Representations ; Social networks ; Text categorization ; Topic discovery ; Topic modelling ; User generated content</subject><ispartof>Information processing & management, 2020-03, Vol.57 (2), p.102034, Article 102034</ispartof><rights>2019 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. Mar 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</citedby><cites>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</cites><orcidid>0000-0001-7837-3171 ; 0000-0003-0572-9936</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Curiskis, Stephan A.</creatorcontrib><creatorcontrib>Drake, Barry</creatorcontrib><creatorcontrib>Osborn, Thomas R.</creatorcontrib><creatorcontrib>Kennedy, Paul J.</creatorcontrib><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><title>Information processing & management</title><description>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</description><subject>Clustering</subject><subject>Data mining</subject><subject>Data models</subject><subject>Datasets</subject><subject>Dirichlet problem</subject><subject>Distance measurement</subject><subject>Document clustering</subject><subject>Embedding</subject><subject>Embedding models</subject><subject>Modelling</subject><subject>Neural networks</subject><subject>Online social networks</subject><subject>Representations</subject><subject>Social networks</subject><subject>Text categorization</subject><subject>Topic discovery</subject><subject>Topic modelling</subject><subject>User generated content</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE9r3DAQxUVoINs0HyA3Qc92Rn-8ttvTsrRNIBAom7OQpXGQ45W2krwh3z7abs-9zDCP95sZHiG3DGoGbH031e6wrzmwvgZZA_ALsmJdK6pGtOwTWYGAdSWbVlyRzylNACAbxldk2niKRz0vOrvgaRipDWbZo8_UzEvKGJ1_odpbmsPBGboPFuf5pDlP81ugwZcJaQrG6Zl6LFp8Td_o7s3lQv9Ff6O1Ln8hl6OeE97869fk-eeP3fa-enz69bDdPFZG8CZXUqJcj6XCMLKm53bQCJybFnomORMcTY9D0-rBim7Nx2FswbaScWu6Tje9uCZfz3sPMfxZMGU1hSX6clJx0cgOOt6J4mJnl4khpYijOkS31_FdMVCnSNWkSqTqFKkCqUqkhfl-ZrC8f3QYVTIOvUHrIpqsbHD_oT8A2NZ_rQ</recordid><startdate>202003</startdate><enddate>202003</enddate><creator>Curiskis, Stephan A.</creator><creator>Drake, Barry</creator><creator>Osborn, Thomas R.</creator><creator>Kennedy, Paul J.</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0001-7837-3171</orcidid><orcidid>https://orcid.org/0000-0003-0572-9936</orcidid></search><sort><creationdate>202003</creationdate><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><author>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Clustering</topic><topic>Data mining</topic><topic>Data models</topic><topic>Datasets</topic><topic>Dirichlet problem</topic><topic>Distance measurement</topic><topic>Document clustering</topic><topic>Embedding</topic><topic>Embedding models</topic><topic>Modelling</topic><topic>Neural networks</topic><topic>Online social networks</topic><topic>Representations</topic><topic>Social networks</topic><topic>Text categorization</topic><topic>Topic discovery</topic><topic>Topic modelling</topic><topic>User generated content</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Curiskis, Stephan A.</creatorcontrib><creatorcontrib>Drake, Barry</creatorcontrib><creatorcontrib>Osborn, Thomas R.</creatorcontrib><creatorcontrib>Kennedy, Paul J.</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Curiskis, Stephan A.</au><au>Drake, Barry</au><au>Osborn, Thomas R.</au><au>Kennedy, Paul J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</atitle><jtitle>Information processing & management</jtitle><date>2020-03</date><risdate>2020</risdate><volume>57</volume><issue>2</issue><spage>102034</spage><pages>102034-</pages><artnum>102034</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering.
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2019.04.002</doi><orcidid>https://orcid.org/0000-0001-7837-3171</orcidid><orcidid>https://orcid.org/0000-0003-0572-9936</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0306-4573 |
ispartof | Information processing & management, 2020-03, Vol.57 (2), p.102034, Article 102034 |
issn | 0306-4573 1873-5371 |
language | eng |
recordid | cdi_proquest_journals_2354808283 |
source | Library & Information Science Abstracts (LISA); ScienceDirect Journals |
subjects | Clustering Data mining Data models Datasets Dirichlet problem Distance measurement Document clustering Embedding Embedding models Modelling Neural networks Online social networks Representations Social networks Text categorization Topic discovery Topic modelling User generated content |
title | An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T17%3A18%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20evaluation%20of%20document%20clustering%20and%20topic%20modelling%20in%20two%20online%20social%20networks:%20Twitter%20and%20Reddit&rft.jtitle=Information%20processing%20&%20management&rft.au=Curiskis,%20Stephan%20A.&rft.date=2020-03&rft.volume=57&rft.issue=2&rft.spage=102034&rft.pages=102034-&rft.artnum=102034&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2019.04.002&rft_dat=%3Cproquest_cross%3E2354808283%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2354808283&rft_id=info:pmid/&rfr_iscdi=true |