Loading…
Experimental study on short-text clustering using transformer-based semantic similarity measure
Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures...
Saved in:
Published in: | PeerJ. Computer science 2024-05, Vol.10, p.e2078-e2078, Article e2078 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c441t-efff5ed8628063711443bf561954de0fb945bd71deb0b51f7b6c6566bb213b1a3 |
container_end_page | e2078 |
container_issue | |
container_start_page | e2078 |
container_title | PeerJ. Computer science |
container_volume | 10 |
creator | Abdalgader, Khaled Matroud, Atheer A Hossin, Khaled |
description | Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks. |
doi_str_mv | 10.7717/peerj-cs.2078 |
format | article |
fullrecord | <record><control><sourceid>gale_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_822c7978d6304184b305df0d5329a840</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A813810692</galeid><doaj_id>oai_doaj_org_article_822c7978d6304184b305df0d5329a840</doaj_id><sourcerecordid>A813810692</sourcerecordid><originalsourceid>FETCH-LOGICAL-c441t-efff5ed8628063711443bf561954de0fb945bd71deb0b51f7b6c6566bb213b1a3</originalsourceid><addsrcrecordid>eNptkstr3DAQxk1paUKaY6_F0Et78FZvy6cSQpIuBAp9nIUeo40X29pKcsj-95WzachCJZCE9JuPmdFXVe8xWrUtbr_sAOK2sWlFUCtfVaeEtqLhXUdevzifVOcpbRFCmOMyurfVCZWSc0LxaaWuHnYQ-xGmrIc65dnt6zDV6S7E3GR4yLUd5pQLMm3qOS1rjnpKPsQRYmN0AlcnGPWUe1unfuwHHfu8r0fQaY7wrnrj9ZDg_Gk_q35fX_26_Nbcfr9ZX17cNpYxnBvw3nNwUhCJBG0xZowazwXuOHOAvOkYN67FDgwyHPvWCCu4EMYQTA3W9KxaH3Rd0Fu1KxXpuFdB9-rxIsSN0rGkOICShNi2a6UTFDEsmaGIO48cp6TTkqGi9fWgtZvNCM6W3kQ9HIkev0z9ndqEe1W6y1tOSFH49KQQw58ZUlZjnywMg54gzElRJARliJGuoB8P6EaX3PrJhyJpF1xdSEwlRqJbBFf_ocp0MPY2TOD7cn8U8PkooDDLb270nJJa__xxzDYH1saQUgT_XCpGavGZevSZskktPiv8h5f9eab_uYr-BfE5z08</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3066340429</pqid></control><display><type>article</type><title>Experimental study on short-text clustering using transformer-based semantic similarity measure</title><source>Publicly Available Content Database</source><source>PubMed Central</source><creator>Abdalgader, Khaled ; Matroud, Atheer A ; Hossin, Khaled</creator><creatorcontrib>Abdalgader, Khaled ; Matroud, Atheer A ; Hossin, Khaled</creatorcontrib><description>Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.</description><identifier>ISSN: 2376-5992</identifier><identifier>EISSN: 2376-5992</identifier><identifier>DOI: 10.7717/peerj-cs.2078</identifier><identifier>PMID: 38855231</identifier><language>eng</language><publisher>United States: PeerJ. Ltd</publisher><subject>Analysis ; Artificial Intelligence ; Co-occurrence representation ; Computational linguistics ; Electric transformers ; Embedding representation ; Language processing ; Laws, regulations and rules ; Measurement ; Natural Language and Speech ; Natural language interfaces ; Network Science and Online Social Networks ; Sentence clustering ; Sentence similarity ; Sentiment Analysis ; Text Mining</subject><ispartof>PeerJ. Computer science, 2024-05, Vol.10, p.e2078-e2078, Article e2078</ispartof><rights>2024 Abdalgader et al.</rights><rights>COPYRIGHT 2024 PeerJ. Ltd.</rights><rights>2024 Abdalgader et al. 2024 Abdalgader et al.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c441t-efff5ed8628063711443bf561954de0fb945bd71deb0b51f7b6c6566bb213b1a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11157522/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11157522/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,37012,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38855231$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Abdalgader, Khaled</creatorcontrib><creatorcontrib>Matroud, Atheer A</creatorcontrib><creatorcontrib>Hossin, Khaled</creatorcontrib><title>Experimental study on short-text clustering using transformer-based semantic similarity measure</title><title>PeerJ. Computer science</title><addtitle>PeerJ Comput Sci</addtitle><description>Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.</description><subject>Analysis</subject><subject>Artificial Intelligence</subject><subject>Co-occurrence representation</subject><subject>Computational linguistics</subject><subject>Electric transformers</subject><subject>Embedding representation</subject><subject>Language processing</subject><subject>Laws, regulations and rules</subject><subject>Measurement</subject><subject>Natural Language and Speech</subject><subject>Natural language interfaces</subject><subject>Network Science and Online Social Networks</subject><subject>Sentence clustering</subject><subject>Sentence similarity</subject><subject>Sentiment Analysis</subject><subject>Text Mining</subject><issn>2376-5992</issn><issn>2376-5992</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>DOA</sourceid><recordid>eNptkstr3DAQxk1paUKaY6_F0Et78FZvy6cSQpIuBAp9nIUeo40X29pKcsj-95WzachCJZCE9JuPmdFXVe8xWrUtbr_sAOK2sWlFUCtfVaeEtqLhXUdevzifVOcpbRFCmOMyurfVCZWSc0LxaaWuHnYQ-xGmrIc65dnt6zDV6S7E3GR4yLUd5pQLMm3qOS1rjnpKPsQRYmN0AlcnGPWUe1unfuwHHfu8r0fQaY7wrnrj9ZDg_Gk_q35fX_26_Nbcfr9ZX17cNpYxnBvw3nNwUhCJBG0xZowazwXuOHOAvOkYN67FDgwyHPvWCCu4EMYQTA3W9KxaH3Rd0Fu1KxXpuFdB9-rxIsSN0rGkOICShNi2a6UTFDEsmaGIO48cp6TTkqGi9fWgtZvNCM6W3kQ9HIkev0z9ndqEe1W6y1tOSFH49KQQw58ZUlZjnywMg54gzElRJARliJGuoB8P6EaX3PrJhyJpF1xdSEwlRqJbBFf_ocp0MPY2TOD7cn8U8PkooDDLb270nJJa__xxzDYH1saQUgT_XCpGavGZevSZskktPiv8h5f9eab_uYr-BfE5z08</recordid><startdate>20240529</startdate><enddate>20240529</enddate><creator>Abdalgader, Khaled</creator><creator>Matroud, Atheer A</creator><creator>Hossin, Khaled</creator><general>PeerJ. Ltd</general><general>PeerJ Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20240529</creationdate><title>Experimental study on short-text clustering using transformer-based semantic similarity measure</title><author>Abdalgader, Khaled ; Matroud, Atheer A ; Hossin, Khaled</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c441t-efff5ed8628063711443bf561954de0fb945bd71deb0b51f7b6c6566bb213b1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Analysis</topic><topic>Artificial Intelligence</topic><topic>Co-occurrence representation</topic><topic>Computational linguistics</topic><topic>Electric transformers</topic><topic>Embedding representation</topic><topic>Language processing</topic><topic>Laws, regulations and rules</topic><topic>Measurement</topic><topic>Natural Language and Speech</topic><topic>Natural language interfaces</topic><topic>Network Science and Online Social Networks</topic><topic>Sentence clustering</topic><topic>Sentence similarity</topic><topic>Sentiment Analysis</topic><topic>Text Mining</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Abdalgader, Khaled</creatorcontrib><creatorcontrib>Matroud, Atheer A</creatorcontrib><creatorcontrib>Hossin, Khaled</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PeerJ. Computer science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Abdalgader, Khaled</au><au>Matroud, Atheer A</au><au>Hossin, Khaled</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Experimental study on short-text clustering using transformer-based semantic similarity measure</atitle><jtitle>PeerJ. Computer science</jtitle><addtitle>PeerJ Comput Sci</addtitle><date>2024-05-29</date><risdate>2024</risdate><volume>10</volume><spage>e2078</spage><epage>e2078</epage><pages>e2078-e2078</pages><artnum>e2078</artnum><issn>2376-5992</issn><eissn>2376-5992</eissn><abstract>Sentence clustering plays a central role in various text-processing activities and has received extensive attention for measuring semantic similarity between compared sentences. However, relatively little focus has been placed on evaluating clustering performance using available similarity measures that adopt low-dimensional continuous representations. Such representations are crucial in domains like sentence clustering, where traditional word co-occurrence representations often achieve poor results when clustering semantically similar sentences that share no common words. This article presents a new implementation that incorporates a sentence similarity measure based on the notion of embedding representation for evaluating the performance of three types of text clustering methods: partitional clustering, hierarchical clustering, and fuzzy clustering, on standard textual datasets. This measure derives its semantic information from pre-training models designed to simulate human knowledge about words in natural language. The article also compares the performance of the used similarity measure by training it on two state-of-the-art pre-training models to investigate which yields better results. We argue that the superior performance of the selected clustering methods stems from their more effective use of the semantic information offered by this embedding-based similarity measure. Furthermore, we use hierarchical clustering, the best-performing method, for a text summarization task and report the results. The implementation in this article demonstrates that incorporating the sentence embedding measure leads to significantly improved performance in both text clustering and text summarization tasks.</abstract><cop>United States</cop><pub>PeerJ. Ltd</pub><pmid>38855231</pmid><doi>10.7717/peerj-cs.2078</doi><tpages>e2078</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2376-5992 |
ispartof | PeerJ. Computer science, 2024-05, Vol.10, p.e2078-e2078, Article e2078 |
issn | 2376-5992 2376-5992 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_822c7978d6304184b305df0d5329a840 |
source | Publicly Available Content Database; PubMed Central |
subjects | Analysis Artificial Intelligence Co-occurrence representation Computational linguistics Electric transformers Embedding representation Language processing Laws, regulations and rules Measurement Natural Language and Speech Natural language interfaces Network Science and Online Social Networks Sentence clustering Sentence similarity Sentiment Analysis Text Mining |
title | Experimental study on short-text clustering using transformer-based semantic similarity measure |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T22%3A57%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Experimental%20study%20on%20short-text%20clustering%20using%20transformer-based%20semantic%20similarity%20measure&rft.jtitle=PeerJ.%20Computer%20science&rft.au=Abdalgader,%20Khaled&rft.date=2024-05-29&rft.volume=10&rft.spage=e2078&rft.epage=e2078&rft.pages=e2078-e2078&rft.artnum=e2078&rft.issn=2376-5992&rft.eissn=2376-5992&rft_id=info:doi/10.7717/peerj-cs.2078&rft_dat=%3Cgale_doaj_%3EA813810692%3C/gale_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c441t-efff5ed8628063711443bf561954de0fb945bd71deb0b51f7b6c6566bb213b1a3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3066340429&rft_id=info:pmid/38855231&rft_galeid=A813810692&rfr_iscdi=true |