Loading…

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms...

Full description

Saved in:
Bibliographic Details
Published in:Information processing & management 2020-03, Vol.57 (2), p.102034, Article 102034
Main Authors: Curiskis, Stephan A., Drake, Barry, Osborn, Thomas R., Kennedy, Paul J.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593
cites cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593
container_end_page
container_issue 2
container_start_page 102034
container_title Information processing & management
container_volume 57
creator Curiskis, Stephan A.
Drake, Barry
Osborn, Thomas R.
Kennedy, Paul J.
description •Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
doi_str_mv 10.1016/j.ipm.2019.04.002
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2354808283</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457318307805</els_id><sourcerecordid>2354808283</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</originalsourceid><addsrcrecordid>eNp9kE9r3DAQxUVoINs0HyA3Qc92Rn-8ttvTsrRNIBAom7OQpXGQ45W2krwh3z7abs-9zDCP95sZHiG3DGoGbH031e6wrzmwvgZZA_ALsmJdK6pGtOwTWYGAdSWbVlyRzylNACAbxldk2niKRz0vOrvgaRipDWbZo8_UzEvKGJ1_odpbmsPBGboPFuf5pDlP81ugwZcJaQrG6Zl6LFp8Td_o7s3lQv9Ff6O1Ln8hl6OeE97869fk-eeP3fa-enz69bDdPFZG8CZXUqJcj6XCMLKm53bQCJybFnomORMcTY9D0-rBim7Nx2FswbaScWu6Tje9uCZfz3sPMfxZMGU1hSX6clJx0cgOOt6J4mJnl4khpYijOkS31_FdMVCnSNWkSqTqFKkCqUqkhfl-ZrC8f3QYVTIOvUHrIpqsbHD_oT8A2NZ_rQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2354808283</pqid></control><display><type>article</type><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><source>Library &amp; Information Science Abstracts (LISA)</source><source>ScienceDirect Journals</source><creator>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</creator><creatorcontrib>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</creatorcontrib><description>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2019.04.002</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Clustering ; Data mining ; Data models ; Datasets ; Dirichlet problem ; Distance measurement ; Document clustering ; Embedding ; Embedding models ; Modelling ; Neural networks ; Online social networks ; Representations ; Social networks ; Text categorization ; Topic discovery ; Topic modelling ; User generated content</subject><ispartof>Information processing &amp; management, 2020-03, Vol.57 (2), p.102034, Article 102034</ispartof><rights>2019 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. Mar 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</citedby><cites>FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</cites><orcidid>0000-0001-7837-3171 ; 0000-0003-0572-9936</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Curiskis, Stephan A.</creatorcontrib><creatorcontrib>Drake, Barry</creatorcontrib><creatorcontrib>Osborn, Thomas R.</creatorcontrib><creatorcontrib>Kennedy, Paul J.</creatorcontrib><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><title>Information processing &amp; management</title><description>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</description><subject>Clustering</subject><subject>Data mining</subject><subject>Data models</subject><subject>Datasets</subject><subject>Dirichlet problem</subject><subject>Distance measurement</subject><subject>Document clustering</subject><subject>Embedding</subject><subject>Embedding models</subject><subject>Modelling</subject><subject>Neural networks</subject><subject>Online social networks</subject><subject>Representations</subject><subject>Social networks</subject><subject>Text categorization</subject><subject>Topic discovery</subject><subject>Topic modelling</subject><subject>User generated content</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE9r3DAQxUVoINs0HyA3Qc92Rn-8ttvTsrRNIBAom7OQpXGQ45W2krwh3z7abs-9zDCP95sZHiG3DGoGbH031e6wrzmwvgZZA_ALsmJdK6pGtOwTWYGAdSWbVlyRzylNACAbxldk2niKRz0vOrvgaRipDWbZo8_UzEvKGJ1_odpbmsPBGboPFuf5pDlP81ugwZcJaQrG6Zl6LFp8Td_o7s3lQv9Ff6O1Ln8hl6OeE97869fk-eeP3fa-enz69bDdPFZG8CZXUqJcj6XCMLKm53bQCJybFnomORMcTY9D0-rBim7Nx2FswbaScWu6Tje9uCZfz3sPMfxZMGU1hSX6clJx0cgOOt6J4mJnl4khpYijOkS31_FdMVCnSNWkSqTqFKkCqUqkhfl-ZrC8f3QYVTIOvUHrIpqsbHD_oT8A2NZ_rQ</recordid><startdate>202003</startdate><enddate>202003</enddate><creator>Curiskis, Stephan A.</creator><creator>Drake, Barry</creator><creator>Osborn, Thomas R.</creator><creator>Kennedy, Paul J.</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0001-7837-3171</orcidid><orcidid>https://orcid.org/0000-0003-0572-9936</orcidid></search><sort><creationdate>202003</creationdate><title>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</title><author>Curiskis, Stephan A. ; Drake, Barry ; Osborn, Thomas R. ; Kennedy, Paul J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Clustering</topic><topic>Data mining</topic><topic>Data models</topic><topic>Datasets</topic><topic>Dirichlet problem</topic><topic>Distance measurement</topic><topic>Document clustering</topic><topic>Embedding</topic><topic>Embedding models</topic><topic>Modelling</topic><topic>Neural networks</topic><topic>Online social networks</topic><topic>Representations</topic><topic>Social networks</topic><topic>Text categorization</topic><topic>Topic discovery</topic><topic>Topic modelling</topic><topic>User generated content</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Curiskis, Stephan A.</creatorcontrib><creatorcontrib>Drake, Barry</creatorcontrib><creatorcontrib>Osborn, Thomas R.</creatorcontrib><creatorcontrib>Kennedy, Paul J.</creatorcontrib><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><jtitle>Information processing &amp; management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Curiskis, Stephan A.</au><au>Drake, Barry</au><au>Osborn, Thomas R.</au><au>Kennedy, Paul J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit</atitle><jtitle>Information processing &amp; management</jtitle><date>2020-03</date><risdate>2020</risdate><volume>57</volume><issue>2</issue><spage>102034</spage><pages>102034-</pages><artnum>102034</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2019.04.002</doi><orcidid>https://orcid.org/0000-0001-7837-3171</orcidid><orcidid>https://orcid.org/0000-0003-0572-9936</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0306-4573
ispartof Information processing & management, 2020-03, Vol.57 (2), p.102034, Article 102034
issn 0306-4573
1873-5371
language eng
recordid cdi_proquest_journals_2354808283
source Library & Information Science Abstracts (LISA); ScienceDirect Journals
subjects Clustering
Data mining
Data models
Datasets
Dirichlet problem
Distance measurement
Document clustering
Embedding
Embedding models
Modelling
Neural networks
Online social networks
Representations
Social networks
Text categorization
Topic discovery
Topic modelling
User generated content
title An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T17%3A18%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20evaluation%20of%20document%20clustering%20and%20topic%20modelling%20in%20two%20online%20social%20networks:%20Twitter%20and%20Reddit&rft.jtitle=Information%20processing%20&%20management&rft.au=Curiskis,%20Stephan%20A.&rft.date=2020-03&rft.volume=57&rft.issue=2&rft.spage=102034&rft.pages=102034-&rft.artnum=102034&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2019.04.002&rft_dat=%3Cproquest_cross%3E2354808283%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c325t-44e46f44e0bf1592dbae022c709142132ec9eb57abd3862fbf70d7412dc88a593%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2354808283&rft_id=info:pmid/&rfr_iscdi=true