Loading…

Evaluation of clustering and topic modeling methods over health-related tweets and emails

•Evaluation of topic modeling and clustering on health-related tweets and emails.•Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.•Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).•The evaluation is based on two internal and five external...

Full description

Saved in:
Bibliographic Details
Published in:Artificial intelligence in medicine 2021-07, Vol.117, p.102096-102096, Article 102096
Main Authors: Lossio-Ventura, Juan Antonio, Gonzales, Sergio, Morzan, Juandiego, Alatrista-Salas, Hugo, Hernandez-Boussard, Tina, Bian, Jiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3
cites cdi_FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3
container_end_page 102096
container_issue
container_start_page 102096
container_title Artificial intelligence in medicine
container_volume 117
creator Lossio-Ventura, Juan Antonio
Gonzales, Sergio
Morzan, Juandiego
Alatrista-Salas, Hugo
Hernandez-Boussard, Tina
Bian, Jiang
description •Evaluation of topic modeling and clustering on health-related tweets and emails.•Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.•Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).•The evaluation is based on two internal and five external cluster validity indices. Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iter
doi_str_mv 10.1016/j.artmed.2021.102096
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9040385</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0933365721000890</els_id><sourcerecordid>2541319939</sourcerecordid><originalsourceid>FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3</originalsourceid><addsrcrecordid>eNp9kctOHDEQRa2IKAyPP4iiXrLpiZ_92CBFiIRISGxgwcoq7DLjkbs92O5B_D09GUKSTVYlle-9Va5DyGdGl4yy5ut6CakMaJeccja3OO2bD2TBulbUvGvoAVnQXohaNKo9JEc5rymlrWTNJ3IoJOMtF2pB7i-3ECYoPo5VdJUJUy6Y_PhYwWirEjfeVEO0GHatAcsq2lzFLaZqhRDKqk4YoOAsfUYs-ZcLB_Ahn5CPDkLG07d6TO6-X95eXNXXNz9-Xny7ro1sRKkdFbTj1jpUINhcHVNcGlACW2gpqA4sKjQSpHXGtZ3g7qGzvWuFxQ5RHJPzfe5mepjPYXAsCYLeJD9AetERvP73ZfQr_Ri3uqeSik7NAWdvASk-TZiLHnw2GAKMGKesuZJMsL4X_SyVe6lJMeeE7n0Mo3pHRa_1noreUdF7KrPty98rvpt-Y_jzB5wPtfWYdDYeR4PWJzRF2-j_P-EVnlWjmg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2541319939</pqid></control><display><type>article</type><title>Evaluation of clustering and topic modeling methods over health-related tweets and emails</title><source>ScienceDirect Freedom Collection</source><creator>Lossio-Ventura, Juan Antonio ; Gonzales, Sergio ; Morzan, Juandiego ; Alatrista-Salas, Hugo ; Hernandez-Boussard, Tina ; Bian, Jiang</creator><creatorcontrib>Lossio-Ventura, Juan Antonio ; Gonzales, Sergio ; Morzan, Juandiego ; Alatrista-Salas, Hugo ; Hernandez-Boussard, Tina ; Bian, Jiang</creatorcontrib><description>•Evaluation of topic modeling and clustering on health-related tweets and emails.•Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.•Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).•The evaluation is based on two internal and five external cluster validity indices. Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.</description><identifier>ISSN: 0933-3657</identifier><identifier>EISSN: 1873-2860</identifier><identifier>DOI: 10.1016/j.artmed.2021.102096</identifier><identifier>PMID: 34127235</identifier><language>eng</language><publisher>Netherlands: Elsevier B.V</publisher><subject>Cluster Analysis ; Clustering ; Communication ; Electronic Mail ; External cluster indices ; Humans ; Internal cluster indices ; Machine Learning ; Natural language processing ; Social Media ; Topic modeling</subject><ispartof>Artificial intelligence in medicine, 2021-07, Vol.117, p.102096-102096, Article 102096</ispartof><rights>2021 Elsevier B.V.</rights><rights>Copyright © 2021 Elsevier B.V. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3</citedby><cites>FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3</cites><orcidid>0000-0002-2238-5429 ; 0000-0002-9591-0343</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,776,780,881,27901,27902</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34127235$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Lossio-Ventura, Juan Antonio</creatorcontrib><creatorcontrib>Gonzales, Sergio</creatorcontrib><creatorcontrib>Morzan, Juandiego</creatorcontrib><creatorcontrib>Alatrista-Salas, Hugo</creatorcontrib><creatorcontrib>Hernandez-Boussard, Tina</creatorcontrib><creatorcontrib>Bian, Jiang</creatorcontrib><title>Evaluation of clustering and topic modeling methods over health-related tweets and emails</title><title>Artificial intelligence in medicine</title><addtitle>Artif Intell Med</addtitle><description>•Evaluation of topic modeling and clustering on health-related tweets and emails.•Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.•Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).•The evaluation is based on two internal and five external cluster validity indices. Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.</description><subject>Cluster Analysis</subject><subject>Clustering</subject><subject>Communication</subject><subject>Electronic Mail</subject><subject>External cluster indices</subject><subject>Humans</subject><subject>Internal cluster indices</subject><subject>Machine Learning</subject><subject>Natural language processing</subject><subject>Social Media</subject><subject>Topic modeling</subject><issn>0933-3657</issn><issn>1873-2860</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kctOHDEQRa2IKAyPP4iiXrLpiZ_92CBFiIRISGxgwcoq7DLjkbs92O5B_D09GUKSTVYlle-9Va5DyGdGl4yy5ut6CakMaJeccja3OO2bD2TBulbUvGvoAVnQXohaNKo9JEc5rymlrWTNJ3IoJOMtF2pB7i-3ECYoPo5VdJUJUy6Y_PhYwWirEjfeVEO0GHatAcsq2lzFLaZqhRDKqk4YoOAsfUYs-ZcLB_Ahn5CPDkLG07d6TO6-X95eXNXXNz9-Xny7ro1sRKkdFbTj1jpUINhcHVNcGlACW2gpqA4sKjQSpHXGtZ3g7qGzvWuFxQ5RHJPzfe5mepjPYXAsCYLeJD9AetERvP73ZfQr_Ri3uqeSik7NAWdvASk-TZiLHnw2GAKMGKesuZJMsL4X_SyVe6lJMeeE7n0Mo3pHRa_1noreUdF7KrPty98rvpt-Y_jzB5wPtfWYdDYeR4PWJzRF2-j_P-EVnlWjmg</recordid><startdate>20210701</startdate><enddate>20210701</enddate><creator>Lossio-Ventura, Juan Antonio</creator><creator>Gonzales, Sergio</creator><creator>Morzan, Juandiego</creator><creator>Alatrista-Salas, Hugo</creator><creator>Hernandez-Boussard, Tina</creator><creator>Bian, Jiang</creator><general>Elsevier B.V</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-2238-5429</orcidid><orcidid>https://orcid.org/0000-0002-9591-0343</orcidid></search><sort><creationdate>20210701</creationdate><title>Evaluation of clustering and topic modeling methods over health-related tweets and emails</title><author>Lossio-Ventura, Juan Antonio ; Gonzales, Sergio ; Morzan, Juandiego ; Alatrista-Salas, Hugo ; Hernandez-Boussard, Tina ; Bian, Jiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Cluster Analysis</topic><topic>Clustering</topic><topic>Communication</topic><topic>Electronic Mail</topic><topic>External cluster indices</topic><topic>Humans</topic><topic>Internal cluster indices</topic><topic>Machine Learning</topic><topic>Natural language processing</topic><topic>Social Media</topic><topic>Topic modeling</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lossio-Ventura, Juan Antonio</creatorcontrib><creatorcontrib>Gonzales, Sergio</creatorcontrib><creatorcontrib>Morzan, Juandiego</creatorcontrib><creatorcontrib>Alatrista-Salas, Hugo</creatorcontrib><creatorcontrib>Hernandez-Boussard, Tina</creatorcontrib><creatorcontrib>Bian, Jiang</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Artificial intelligence in medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lossio-Ventura, Juan Antonio</au><au>Gonzales, Sergio</au><au>Morzan, Juandiego</au><au>Alatrista-Salas, Hugo</au><au>Hernandez-Boussard, Tina</au><au>Bian, Jiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluation of clustering and topic modeling methods over health-related tweets and emails</atitle><jtitle>Artificial intelligence in medicine</jtitle><addtitle>Artif Intell Med</addtitle><date>2021-07-01</date><risdate>2021</risdate><volume>117</volume><spage>102096</spage><epage>102096</epage><pages>102096-102096</pages><artnum>102096</artnum><issn>0933-3657</issn><eissn>1873-2860</eissn><abstract>•Evaluation of topic modeling and clustering on health-related tweets and emails.•Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.•Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).•The evaluation is based on two internal and five external cluster validity indices. Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.</abstract><cop>Netherlands</cop><pub>Elsevier B.V</pub><pmid>34127235</pmid><doi>10.1016/j.artmed.2021.102096</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-2238-5429</orcidid><orcidid>https://orcid.org/0000-0002-9591-0343</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0933-3657
ispartof Artificial intelligence in medicine, 2021-07, Vol.117, p.102096-102096, Article 102096
issn 0933-3657
1873-2860
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9040385
source ScienceDirect Freedom Collection
subjects Cluster Analysis
Clustering
Communication
Electronic Mail
External cluster indices
Humans
Internal cluster indices
Machine Learning
Natural language processing
Social Media
Topic modeling
title Evaluation of clustering and topic modeling methods over health-related tweets and emails
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T03%3A14%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluation%20of%20clustering%20and%20topic%20modeling%20methods%20over%20health-related%20tweets%20and%20emails&rft.jtitle=Artificial%20intelligence%20in%20medicine&rft.au=Lossio-Ventura,%20Juan%20Antonio&rft.date=2021-07-01&rft.volume=117&rft.spage=102096&rft.epage=102096&rft.pages=102096-102096&rft.artnum=102096&rft.issn=0933-3657&rft.eissn=1873-2860&rft_id=info:doi/10.1016/j.artmed.2021.102096&rft_dat=%3Cproquest_pubme%3E2541319939%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c463t-f03082ddfe5a31ddff1524ca53e7a70a58ade5ec4a4dfcf7832fb8d9f73de8ee3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2541319939&rft_id=info:pmid/34127235&rfr_iscdi=true