Loading…

Automated Phrase Mining from Massive Text Corpora

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguist...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on knowledge and data engineering 2018-10, Vol.30 (10), p.1825-1837
Main Authors: Shang, Jingbo, Liu, Jialu, Jiang, Meng, Ren, Xiang, Voss, Clare R., Han, Jiawei
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443
cites cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443
container_end_page 1837
container_issue 10
container_start_page 1825
container_title IEEE transactions on knowledge and data engineering
container_volume 30
creator Shang, Jingbo
Liu, Jialu
Jiang, Meng
Ren, Xiang
Voss, Clare R.
Han, Jiawei
description As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.
doi_str_mv 10.1109/TKDE.2018.2812203
format article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_proquest_journals_2117164913</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8306825</ieee_id><sourcerecordid>2117164913</sourcerecordid><originalsourceid>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</originalsourceid><addsrcrecordid>eNpdkV1LwzAUhoMofv8AEaTgjTedOflo0xtB5ic69GJehyw9mZW1mUkr-u_t2BzqVQLned_k8BByBHQAQIvz8cPV9YBRUAOmgDHKN8guSKlSBgVs9ncqIBVc5DtkL8Y3SqnKFWyTHd7HpQC2S-Cya31tWiyT59dgIiajqqmaaeKCr5ORibH6wGSMn20y9GHugzkgW87MIh6uzn3ycnM9Ht6lj0-398PLx9RK4G3qJlxIB1xwxxEmk8JayUtQVgkFTtjcQQllgUgzELZ0OfIcAUuRlZAVQvB9crHsnXeTGkuLTRvMTM9DVZvwpb2p9N9JU73qqf_QmYSiENAXnK0Kgn_vMLa6rqLF2cw06LuoGeOM5pKxxVun_9A334WmX08zgBwyUQDvKVhSNvgYA7r1Z4DqhRC9EKIXQvRKSJ85-b3FOvFjoAeOl0CFiOux4jRTTPJvQ9iOPg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2117164913</pqid></control><display><type>article</type><title>Automated Phrase Mining from Massive Text Corpora</title><source>IEEE Xplore (Online service)</source><creator>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</creator><creatorcontrib>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</creatorcontrib><description>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2018.2812203</identifier><identifier>PMID: 31105412</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Analyzers ; Automatic phrase mining ; Automation ; distant training ; Domains ; Electronic publishing ; Encyclopedias ; Information retrieval ; Internet ; Knowledge base ; Knowledge based systems ; Labeling ; Mining ; multiple languages ; part-of-speech tag ; phrase mining ; Pragmatics ; State of the art ; Taxonomy</subject><ispartof>IEEE transactions on knowledge and data engineering, 2018-10, Vol.30 (10), p.1825-1837</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</citedby><cites>FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</cites><orcidid>0000-0002-7249-4404</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8306825$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,885,27924,27925,54796</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31105412$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Liu, Jialu</creatorcontrib><creatorcontrib>Jiang, Meng</creatorcontrib><creatorcontrib>Ren, Xiang</creatorcontrib><creatorcontrib>Voss, Clare R.</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><title>Automated Phrase Mining from Massive Text Corpora</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><description>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</description><subject>Analyzers</subject><subject>Automatic phrase mining</subject><subject>Automation</subject><subject>distant training</subject><subject>Domains</subject><subject>Electronic publishing</subject><subject>Encyclopedias</subject><subject>Information retrieval</subject><subject>Internet</subject><subject>Knowledge base</subject><subject>Knowledge based systems</subject><subject>Labeling</subject><subject>Mining</subject><subject>multiple languages</subject><subject>part-of-speech tag</subject><subject>phrase mining</subject><subject>Pragmatics</subject><subject>State of the art</subject><subject>Taxonomy</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNpdkV1LwzAUhoMofv8AEaTgjTedOflo0xtB5ic69GJehyw9mZW1mUkr-u_t2BzqVQLned_k8BByBHQAQIvz8cPV9YBRUAOmgDHKN8guSKlSBgVs9ncqIBVc5DtkL8Y3SqnKFWyTHd7HpQC2S-Cya31tWiyT59dgIiajqqmaaeKCr5ORibH6wGSMn20y9GHugzkgW87MIh6uzn3ycnM9Ht6lj0-398PLx9RK4G3qJlxIB1xwxxEmk8JayUtQVgkFTtjcQQllgUgzELZ0OfIcAUuRlZAVQvB9crHsnXeTGkuLTRvMTM9DVZvwpb2p9N9JU73qqf_QmYSiENAXnK0Kgn_vMLa6rqLF2cw06LuoGeOM5pKxxVun_9A334WmX08zgBwyUQDvKVhSNvgYA7r1Z4DqhRC9EKIXQvRKSJ85-b3FOvFjoAeOl0CFiOux4jRTTPJvQ9iOPg</recordid><startdate>20181001</startdate><enddate>20181001</enddate><creator>Shang, Jingbo</creator><creator>Liu, Jialu</creator><creator>Jiang, Meng</creator><creator>Ren, Xiang</creator><creator>Voss, Clare R.</creator><creator>Han, Jiawei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-7249-4404</orcidid></search><sort><creationdate>20181001</creationdate><title>Automated Phrase Mining from Massive Text Corpora</title><author>Shang, Jingbo ; Liu, Jialu ; Jiang, Meng ; Ren, Xiang ; Voss, Clare R. ; Han, Jiawei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Analyzers</topic><topic>Automatic phrase mining</topic><topic>Automation</topic><topic>distant training</topic><topic>Domains</topic><topic>Electronic publishing</topic><topic>Encyclopedias</topic><topic>Information retrieval</topic><topic>Internet</topic><topic>Knowledge base</topic><topic>Knowledge based systems</topic><topic>Labeling</topic><topic>Mining</topic><topic>multiple languages</topic><topic>part-of-speech tag</topic><topic>phrase mining</topic><topic>Pragmatics</topic><topic>State of the art</topic><topic>Taxonomy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Liu, Jialu</creatorcontrib><creatorcontrib>Jiang, Meng</creatorcontrib><creatorcontrib>Ren, Xiang</creatorcontrib><creatorcontrib>Voss, Clare R.</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore (Online service)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shang, Jingbo</au><au>Liu, Jialu</au><au>Jiang, Meng</au><au>Ren, Xiang</au><au>Voss, Clare R.</au><au>Han, Jiawei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automated Phrase Mining from Massive Text Corpora</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><date>2018-10-01</date><risdate>2018</risdate><volume>30</volume><issue>10</issue><spage>1825</spage><epage>1837</epage><pages>1825-1837</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31105412</pmid><doi>10.1109/TKDE.2018.2812203</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-7249-4404</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2018-10, Vol.30 (10), p.1825-1837
issn 1041-4347
1558-2191
language eng
recordid cdi_proquest_journals_2117164913
source IEEE Xplore (Online service)
subjects Analyzers
Automatic phrase mining
Automation
distant training
Domains
Electronic publishing
Encyclopedias
Information retrieval
Internet
Knowledge base
Knowledge based systems
Labeling
Mining
multiple languages
part-of-speech tag
phrase mining
Pragmatics
State of the art
Taxonomy
title Automated Phrase Mining from Massive Text Corpora
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A55%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automated%20Phrase%20Mining%20from%20Massive%20Text%20Corpora&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Shang,%20Jingbo&rft.date=2018-10-01&rft.volume=30&rft.issue=10&rft.spage=1825&rft.epage=1837&rft.pages=1825-1837&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2018.2812203&rft_dat=%3Cproquest_pubme%3E2117164913%3C/proquest_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c513t-fb345f1343f3e1bb9cc53d18c8481f4c7f1d1d9ee0614cdf7e37e1ed46d169443%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2117164913&rft_id=info:pmid/31105412&rft_ieee_id=8306825&rfr_iscdi=true