Loading…

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms...

Full description

Saved in:
Bibliographic Details
Published in:Information processing & management 2020-03, Vol.57 (2), p.102034, Article 102034
Main Authors: Curiskis, Stephan A., Drake, Barry, Osborn, Thomas R., Kennedy, Paul J.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Document clustering with document embedding representations combined with k-means clustering delivered the best performance.•The epochs required for optimal training of document embeddings is in general inversely proportional to the document length.•Document clusters can be interpreted by top terms extracted from combining TF-IDF scores with word embedding similarities.•The Adjusted Rand Index and Adjusted Mutual Information are the most appropriate extrinsic evaluation measures for clustering. Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
ISSN:0306-4573
1873-5371
DOI:10.1016/j.ipm.2019.04.002