Loading…
Automated Phrase Mining from Massive Text Corpora
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguist...
Saved in:
Published in: | IEEE transactions on knowledge and data engineering 2018-10, Vol.30 (10), p.1825-1837 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443 |
---|---|
cites | cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443 |
container_end_page | 1837 |
container_issue | 10 |
container_start_page | 1825 |
container_title | IEEE transactions on knowledge and data engineering |
container_volume | 30 |
creator | Shang, Jingbo Liu, Jialu Jiang, Meng Ren, Xiang Voss, Clare R. Han, Jiawei |
description | As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases. |
doi_str_mv | 10.1109/TKDE.2018.2812203 |
format | article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_journals_2117164913</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8306825</ieee_id><sourcerecordid>2117164913</sourcerecordid><originalsourceid>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</originalsourceid><addsrcrecordid>eNpdkV1LwzAUhoMofv8AEaTgjTedOflo0xtB5ic69GJehyw9mZW1mUkr-u_t2BzqVQLned_k8BByBHQAQIvz8cPV9YBRUAOmgDHKN8guSKlSBgVs9ncqIBVc5DtkL8Y3SqnKFWyTHd7HpQC2S-Cya31tWiyT59dgIiajqqmaaeKCr5ORibH6wGSMn20y9GHugzkgW87MIh6uzn3ycnM9Ht6lj0-398PLx9RK4G3qJlxIB1xwxxEmk8JayUtQVgkFTtjcQQllgUgzELZ0OfIcAUuRlZAVQvB9crHsnXeTGkuLTRvMTM9DVZvwpb2p9N9JU73qqf_QmYSiENAXnK0Kgn_vMLa6rqLF2cw06LuoGeOM5pKxxVun_9A334WmX08zgBwyUQDvKVhSNvgYA7r1Z4DqhRC9EKIXQvRKSJ85-b3FOvFjoAeOl0CFiOux4jRTTPJvQ9iOPg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2117164913</pqid></control><display><type>article</type><title>Automated Phrase Mining from Massive Text Corpora</title><source>IEEE Xplore (Online service)</source><creator>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</creator><creatorcontrib>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</creatorcontrib><description>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2018.2812203</identifier><identifier>PMID: 31105412</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Analyzers ; Automatic phrase mining ; Automation ; distant training ; Domains ; Electronic publishing ; Encyclopedias ; Information retrieval ; Internet ; Knowledge base ; Knowledge based systems ; Labeling ; Mining ; multiple languages ; part-of-speech tag ; phrase mining ; Pragmatics ; State of the art ; Taxonomy</subject><ispartof>IEEE transactions on knowledge and data engineering, 2018-10, Vol.30 (10), p.1825-1837</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</citedby><cites>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</cites><orcidid>0000-0002-7249-4404</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8306825$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,885,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31105412$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Liu, Jialu</creatorcontrib><creatorcontrib>Jiang, Meng</creatorcontrib><creatorcontrib>Ren, Xiang</creatorcontrib><creatorcontrib>Voss, Clare R.</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><title>Automated Phrase Mining from Massive Text Corpora</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><description>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</description><subject>Analyzers</subject><subject>Automatic phrase mining</subject><subject>Automation</subject><subject>distant training</subject><subject>Domains</subject><subject>Electronic publishing</subject><subject>Encyclopedias</subject><subject>Information retrieval</subject><subject>Internet</subject><subject>Knowledge base</subject><subject>Knowledge based systems</subject><subject>Labeling</subject><subject>Mining</subject><subject>multiple languages</subject><subject>part-of-speech tag</subject><subject>phrase mining</subject><subject>Pragmatics</subject><subject>State of the art</subject><subject>Taxonomy</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNpdkV1LwzAUhoMofv8AEaTgjTedOflo0xtB5ic69GJehyw9mZW1mUkr-u_t2BzqVQLned_k8BByBHQAQIvz8cPV9YBRUAOmgDHKN8guSKlSBgVs9ncqIBVc5DtkL8Y3SqnKFWyTHd7HpQC2S-Cya31tWiyT59dgIiajqqmaaeKCr5ORibH6wGSMn20y9GHugzkgW87MIh6uzn3ycnM9Ht6lj0-398PLx9RK4G3qJlxIB1xwxxEmk8JayUtQVgkFTtjcQQllgUgzELZ0OfIcAUuRlZAVQvB9crHsnXeTGkuLTRvMTM9DVZvwpb2p9N9JU73qqf_QmYSiENAXnK0Kgn_vMLa6rqLF2cw06LuoGeOM5pKxxVun_9A334WmX08zgBwyUQDvKVhSNvgYA7r1Z4DqhRC9EKIXQvRKSJ85-b3FOvFjoAeOl0CFiOux4jRTTPJvQ9iOPg</recordid><startdate>20181001</startdate><enddate>20181001</enddate><creator>Shang, Jingbo</creator><creator>Liu, Jialu</creator><creator>Jiang, Meng</creator><creator>Ren, Xiang</creator><creator>Voss, Clare R.</creator><creator>Han, Jiawei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-7249-4404</orcidid></search><sort><creationdate>20181001</creationdate><title>Automated Phrase Mining from Massive Text Corpora</title><author>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Analyzers</topic><topic>Automatic phrase mining</topic><topic>Automation</topic><topic>distant training</topic><topic>Domains</topic><topic>Electronic publishing</topic><topic>Encyclopedias</topic><topic>Information retrieval</topic><topic>Internet</topic><topic>Knowledge base</topic><topic>Knowledge based systems</topic><topic>Labeling</topic><topic>Mining</topic><topic>multiple languages</topic><topic>part-of-speech tag</topic><topic>phrase mining</topic><topic>Pragmatics</topic><topic>State of the art</topic><topic>Taxonomy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Liu, Jialu</creatorcontrib><creatorcontrib>Jiang, Meng</creatorcontrib><creatorcontrib>Ren, Xiang</creatorcontrib><creatorcontrib>Voss, Clare R.</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore (Online service)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shang, Jingbo</au><au>Liu, Jialu</au><au>Jiang, Meng</au><au>Ren, Xiang</au><au>Voss, Clare R.</au><au>Han, Jiawei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automated Phrase Mining from Massive Text Corpora</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><date>2018-10-01</date><risdate>2018</risdate><volume>30</volume><issue>10</issue><spage>1825</spage><epage>1837</epage><pages>1825-1837</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31105412</pmid><doi>10.1109/TKDE.2018.2812203</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-7249-4404</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1041-4347 |
ispartof | IEEE transactions on knowledge and data engineering, 2018-10, Vol.30 (10), p.1825-1837 |
issn | 1041-4347 1558-2191 |
language | eng |
recordid | cdi_proquest_journals_2117164913 |
source | IEEE Xplore (Online service) |
subjects | Analyzers Automatic phrase mining Automation distant training Domains Electronic publishing Encyclopedias Information retrieval Internet Knowledge base Knowledge based systems Labeling Mining multiple languages part-of-speech tag phrase mining Pragmatics State of the art Taxonomy |
title | Automated Phrase Mining from Massive Text Corpora |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A55%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automated%20Phrase%20Mining%20from%20Massive%20Text%20Corpora&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Shang,%20Jingbo&rft.date=2018-10-01&rft.volume=30&rft.issue=10&rft.spage=1825&rft.epage=1837&rft.pages=1825-1837&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2018.2812203&rft_dat=%3Cproquest_pubme%3E2117164913%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2117164913&rft_id=info:pmid/31105412&rft_ieee_id=8306825&rfr_iscdi=true |