Loading…

STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques

A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing". The World Wide Web is today the main "all kind of information" repository and has been...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on knowledge and data engineering 2005-12, Vol.17 (12), p.1638-1652
Main Authors: Papadakis, N.K., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13
cites cdi_FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13
container_end_page 1652
container_issue 12
container_start_page 1638
container_title IEEE transactions on knowledge and data engineering
container_volume 17
creator Papadakis, N.K.
Skoutas, D.
Raftopoulos, K.
Varvarigou, T.A.
description A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing". The World Wide Web is today the main "all kind of information" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.
doi_str_mv 10.1109/TKDE.2005.203
format article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_miscellaneous_28076585</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1524964</ieee_id><sourcerecordid>896185045</sourcerecordid><originalsourceid>FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13</originalsourceid><addsrcrecordid>eNqFkU9v1DAQxSMEEqXlyImLhQScUsb_YptbVRZatRKHLnCMHMfZTUnsxXZU-in4ytjdSpU40MvzSP69N5qZqnqF4RhjUB_WF59WxwSAZ6FPqgPMuawJVvhproHhmlEmnlcvYrwGACkkPqj-XK1Pvp-vrj4ijeJtTHZGgw9odFlnnUbvkP2dgjZ35RD8jBb30_kbh37YDvU6ZZ9fgrERpW3wy2aL9JJ88Zo75Cbo3c4GtLHOhn3iEke3QWZacr9QymTN1o2_FhuPqmeDnqJ9ef8eVt8-r9anZ_Xl1y_npyeXtWGCpbohknHVgAHTcUkUdFpbIFz2BkRH-0FhKpnoejVQ2vVMlgXhpiO9VUr0mB5W7_e5u-BL39TOYzR2mrSzfomtVA2WHBjP5Lv_kkQ1nBMgj4MSRMNlSXzzD3idN-jyuK3CBIQkwDJU7yETfIzBDu0ujLMOty2GtkzTlnO35dxZaObf3ofqaPQ0BO3MGB9MgnAmSMl9vedGa-3DNydMNYz-BXYTtCM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>912078204</pqid></control><display><type>article</type><title>STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques</title><source>IEEE Electronic Library (IEL) Journals</source><creator>Papadakis, N.K. ; Skoutas, D. ; Raftopoulos, K. ; Varvarigou, T.A.</creator><creatorcontrib>Papadakis, N.K. ; Skoutas, D. ; Raftopoulos, K. ; Varvarigou, T.A.</creatorcontrib><description>A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing". The World Wide Web is today the main "all kind of information" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2005.203</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Algorithms ; Applied sciences ; Automation ; Clustering ; Clustering algorithms ; Computer science; control theory; systems ; Computer Society ; Data mining ; data source wrappers ; Exact sciences and technology ; Extraction ; generic wrappers ; Human ; Humans ; Index Terms- Automatic wrappers ; Information retrieval ; Information systems. Data bases ; Intelligent agent ; intelligent agents on the Web ; Memory organisation. Data processing ; Repositories ; resource discovery ; Software ; Studies ; Technological innovation ; Web data extraction ; Web mining ; Web pages ; Web sites ; Web structure mining ; Websites ; World Wide Web</subject><ispartof>IEEE transactions on knowledge and data engineering, 2005-12, Vol.17 (12), p.1638-1652</ispartof><rights>2006 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2005</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13</citedby><cites>FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1524964$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,54796</link.rule.ids><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=17254724$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Papadakis, N.K.</creatorcontrib><creatorcontrib>Skoutas, D.</creatorcontrib><creatorcontrib>Raftopoulos, K.</creatorcontrib><creatorcontrib>Varvarigou, T.A.</creatorcontrib><title>STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><description>A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing". The World Wide Web is today the main "all kind of information" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Automation</subject><subject>Clustering</subject><subject>Clustering algorithms</subject><subject>Computer science; control theory; systems</subject><subject>Computer Society</subject><subject>Data mining</subject><subject>data source wrappers</subject><subject>Exact sciences and technology</subject><subject>Extraction</subject><subject>generic wrappers</subject><subject>Human</subject><subject>Humans</subject><subject>Index Terms- Automatic wrappers</subject><subject>Information retrieval</subject><subject>Information systems. Data bases</subject><subject>Intelligent agent</subject><subject>intelligent agents on the Web</subject><subject>Memory organisation. Data processing</subject><subject>Repositories</subject><subject>resource discovery</subject><subject>Software</subject><subject>Studies</subject><subject>Technological innovation</subject><subject>Web data extraction</subject><subject>Web mining</subject><subject>Web pages</subject><subject>Web sites</subject><subject>Web structure mining</subject><subject>Websites</subject><subject>World Wide Web</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2005</creationdate><recordtype>article</recordtype><recordid>eNqFkU9v1DAQxSMEEqXlyImLhQScUsb_YptbVRZatRKHLnCMHMfZTUnsxXZU-in4ytjdSpU40MvzSP69N5qZqnqF4RhjUB_WF59WxwSAZ6FPqgPMuawJVvhproHhmlEmnlcvYrwGACkkPqj-XK1Pvp-vrj4ijeJtTHZGgw9odFlnnUbvkP2dgjZ35RD8jBb30_kbh37YDvU6ZZ9fgrERpW3wy2aL9JJ88Zo75Cbo3c4GtLHOhn3iEke3QWZacr9QymTN1o2_FhuPqmeDnqJ9ef8eVt8-r9anZ_Xl1y_npyeXtWGCpbohknHVgAHTcUkUdFpbIFz2BkRH-0FhKpnoejVQ2vVMlgXhpiO9VUr0mB5W7_e5u-BL39TOYzR2mrSzfomtVA2WHBjP5Lv_kkQ1nBMgj4MSRMNlSXzzD3idN-jyuK3CBIQkwDJU7yETfIzBDu0ujLMOty2GtkzTlnO35dxZaObf3ofqaPQ0BO3MGB9MgnAmSMl9vedGa-3DNydMNYz-BXYTtCM</recordid><startdate>20051201</startdate><enddate>20051201</enddate><creator>Papadakis, N.K.</creator><creator>Skoutas, D.</creator><creator>Raftopoulos, K.</creator><creator>Varvarigou, T.A.</creator><general>IEEE</general><general>IEEE Computer Society</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7TB</scope><scope>FR3</scope><scope>F28</scope></search><sort><creationdate>20051201</creationdate><title>STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques</title><author>Papadakis, N.K. ; Skoutas, D. ; Raftopoulos, K. ; Varvarigou, T.A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2005</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Automation</topic><topic>Clustering</topic><topic>Clustering algorithms</topic><topic>Computer science; control theory; systems</topic><topic>Computer Society</topic><topic>Data mining</topic><topic>data source wrappers</topic><topic>Exact sciences and technology</topic><topic>Extraction</topic><topic>generic wrappers</topic><topic>Human</topic><topic>Humans</topic><topic>Index Terms- Automatic wrappers</topic><topic>Information retrieval</topic><topic>Information systems. Data bases</topic><topic>Intelligent agent</topic><topic>intelligent agents on the Web</topic><topic>Memory organisation. Data processing</topic><topic>Repositories</topic><topic>resource discovery</topic><topic>Software</topic><topic>Studies</topic><topic>Technological innovation</topic><topic>Web data extraction</topic><topic>Web mining</topic><topic>Web pages</topic><topic>Web sites</topic><topic>Web structure mining</topic><topic>Websites</topic><topic>World Wide Web</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Papadakis, N.K.</creatorcontrib><creatorcontrib>Skoutas, D.</creatorcontrib><creatorcontrib>Raftopoulos, K.</creatorcontrib><creatorcontrib>Varvarigou, T.A.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Engineering Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Papadakis, N.K.</au><au>Skoutas, D.</au><au>Raftopoulos, K.</au><au>Varvarigou, T.A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><date>2005-12-01</date><risdate>2005</risdate><volume>17</volume><issue>12</issue><spage>1638</spage><epage>1652</epage><pages>1638-1652</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing". The World Wide Web is today the main "all kind of information" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm.</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TKDE.2005.203</doi><tpages>15</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2005-12, Vol.17 (12), p.1638-1652
issn 1041-4347
1558-2191
language eng
recordid cdi_proquest_miscellaneous_28076585
source IEEE Electronic Library (IEL) Journals
subjects Algorithms
Applied sciences
Automation
Clustering
Clustering algorithms
Computer science
control theory
systems
Computer Society
Data mining
data source wrappers
Exact sciences and technology
Extraction
generic wrappers
Human
Humans
Index Terms- Automatic wrappers
Information retrieval
Information systems. Data bases
Intelligent agent
intelligent agents on the Web
Memory organisation. Data processing
Repositories
resource discovery
Software
Studies
Technological innovation
Web data extraction
Web mining
Web pages
Web sites
Web structure mining
Websites
World Wide Web
title STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T03%3A55%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=STAVIES:%20a%20system%20for%20information%20extraction%20from%20unknown%20Web%20data%20sources%20through%20automatic%20Web%20wrapper%20generation%20using%20clustering%20techniques&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Papadakis,%20N.K.&rft.date=2005-12-01&rft.volume=17&rft.issue=12&rft.spage=1638&rft.epage=1652&rft.pages=1638-1652&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2005.203&rft_dat=%3Cproquest_ieee_%3E896185045%3C/proquest_ieee_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c474t-62845960c0cb58290baae0258dc07b3df913847bd9f33bd48110916b2de997d13%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=912078204&rft_id=info:pmid/&rft_ieee_id=1524964&rfr_iscdi=true