Loading…

CALA: An unsupervised URL-based web page classification system

Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems 2014-02, Vol.57, p.168-180
Main Authors: Hernández, Inma, Rivero, Carlos R., Ruiz, David, Corchuelo, Rafael
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33
cites cdi_FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33
container_end_page 180
container_issue
container_start_page 168
container_title Knowledge-based systems
container_volume 57
creator Hernández, Inma
Rivero, Carlos R.
Ruiz, David
Corchuelo, Rafael
description Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.
doi_str_mv 10.1016/j.knosys.2013.12.019
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1567086688</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0950705113003997</els_id><sourcerecordid>1567086688</sourcerecordid><originalsourceid>FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33</originalsourceid><addsrcrecordid>eNqNkEtPwzAQhC0EEuXxDzjkyCVhN3HcmEOlquIlVUJC9GzZzhq5pEmIk6L-e1KFM-K0e5iZ3fkYu0FIEFDcbZPPugmHkKSAWYJpAihP2AyLeRrPOchTNgOZQzyHHM_ZRQhbAEhTLGZssVqul_fRso6GOgwtdXsfqIw2b-vY6OP2TSZq9QdFttIheOet7n1TR-O5nnZX7MzpKtD177xkm8eH99VzvH59ehmjY8ux6ONSWM6RSpMJ1E5mBjPpwFh0mJbGaANalEKiJK0lZuPj2nEAzrVxea6z7JLdTrlt13wNFHq188FSVemamiEozMUcCiGK4n_SLE9zHKV8ktquCaEjp9rO73R3UAjqSFZt1URWHckqTNVIdrQtJhuNjfeeOhWsp9pS6TuyvSob_3fAD0pXguM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1567035251</pqid></control><display><type>article</type><title>CALA: An unsupervised URL-based web page classification system</title><source>Library &amp; Information Science Abstracts (LISA)</source><source>ScienceDirect Freedom Collection 2022-2024</source><creator>Hernández, Inma ; Rivero, Carlos R. ; Ruiz, David ; Corchuelo, Rafael</creator><creatorcontrib>Hernández, Inma ; Rivero, Carlos R. ; Ruiz, David ; Corchuelo, Rafael</creatorcontrib><description>Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.</description><identifier>ISSN: 0950-7051</identifier><identifier>EISSN: 1872-7409</identifier><identifier>DOI: 10.1016/j.knosys.2013.12.019</identifier><language>eng</language><publisher>Elsevier B.V</publisher><subject>Classification ; Clustering ; Clusters ; Construction ; Enterprise web information integration ; Knowledge base ; Methods ; Proposals ; Uniform Resource Locators ; URL classification ; URL patterns ; Web page classification ; Web page clustering ; Web pages ; Websites ; Weight reduction ; World Wide Web</subject><ispartof>Knowledge-based systems, 2014-02, Vol.57, p.168-180</ispartof><rights>2013 Elsevier B.V.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33</citedby><cites>FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34136</link.rule.ids></links><search><creatorcontrib>Hernández, Inma</creatorcontrib><creatorcontrib>Rivero, Carlos R.</creatorcontrib><creatorcontrib>Ruiz, David</creatorcontrib><creatorcontrib>Corchuelo, Rafael</creatorcontrib><title>CALA: An unsupervised URL-based web page classification system</title><title>Knowledge-based systems</title><description>Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.</description><subject>Classification</subject><subject>Clustering</subject><subject>Clusters</subject><subject>Construction</subject><subject>Enterprise web information integration</subject><subject>Knowledge base</subject><subject>Methods</subject><subject>Proposals</subject><subject>Uniform Resource Locators</subject><subject>URL classification</subject><subject>URL patterns</subject><subject>Web page classification</subject><subject>Web page clustering</subject><subject>Web pages</subject><subject>Websites</subject><subject>Weight reduction</subject><subject>World Wide Web</subject><issn>0950-7051</issn><issn>1872-7409</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNqNkEtPwzAQhC0EEuXxDzjkyCVhN3HcmEOlquIlVUJC9GzZzhq5pEmIk6L-e1KFM-K0e5iZ3fkYu0FIEFDcbZPPugmHkKSAWYJpAihP2AyLeRrPOchTNgOZQzyHHM_ZRQhbAEhTLGZssVqul_fRso6GOgwtdXsfqIw2b-vY6OP2TSZq9QdFttIheOet7n1TR-O5nnZX7MzpKtD177xkm8eH99VzvH59ehmjY8ux6ONSWM6RSpMJ1E5mBjPpwFh0mJbGaANalEKiJK0lZuPj2nEAzrVxea6z7JLdTrlt13wNFHq188FSVemamiEozMUcCiGK4n_SLE9zHKV8ktquCaEjp9rO73R3UAjqSFZt1URWHckqTNVIdrQtJhuNjfeeOhWsp9pS6TuyvSob_3fAD0pXguM</recordid><startdate>201402</startdate><enddate>201402</enddate><creator>Hernández, Inma</creator><creator>Rivero, Carlos R.</creator><creator>Ruiz, David</creator><creator>Corchuelo, Rafael</creator><general>Elsevier B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8BP</scope><scope>E3H</scope><scope>F2A</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201402</creationdate><title>CALA: An unsupervised URL-based web page classification system</title><author>Hernández, Inma ; Rivero, Carlos R. ; Ruiz, David ; Corchuelo, Rafael</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Classification</topic><topic>Clustering</topic><topic>Clusters</topic><topic>Construction</topic><topic>Enterprise web information integration</topic><topic>Knowledge base</topic><topic>Methods</topic><topic>Proposals</topic><topic>Uniform Resource Locators</topic><topic>URL classification</topic><topic>URL patterns</topic><topic>Web page classification</topic><topic>Web page clustering</topic><topic>Web pages</topic><topic>Websites</topic><topic>Weight reduction</topic><topic>World Wide Web</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hernández, Inma</creatorcontrib><creatorcontrib>Rivero, Carlos R.</creatorcontrib><creatorcontrib>Ruiz, David</creatorcontrib><creatorcontrib>Corchuelo, Rafael</creatorcontrib><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA) - CILIP Edition</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Knowledge-based systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hernández, Inma</au><au>Rivero, Carlos R.</au><au>Ruiz, David</au><au>Corchuelo, Rafael</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CALA: An unsupervised URL-based web page classification system</atitle><jtitle>Knowledge-based systems</jtitle><date>2014-02</date><risdate>2014</risdate><volume>57</volume><spage>168</spage><epage>180</epage><pages>168-180</pages><issn>0950-7051</issn><eissn>1872-7409</eissn><abstract>Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.</abstract><pub>Elsevier B.V</pub><doi>10.1016/j.knosys.2013.12.019</doi><tpages>13</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0950-7051
ispartof Knowledge-based systems, 2014-02, Vol.57, p.168-180
issn 0950-7051
1872-7409
language eng
recordid cdi_proquest_miscellaneous_1567086688
source Library & Information Science Abstracts (LISA); ScienceDirect Freedom Collection 2022-2024
subjects Classification
Clustering
Clusters
Construction
Enterprise web information integration
Knowledge base
Methods
Proposals
Uniform Resource Locators
URL classification
URL patterns
Web page classification
Web page clustering
Web pages
Websites
Weight reduction
World Wide Web
title CALA: An unsupervised URL-based web page classification system
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T01%3A24%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CALA:%20An%20unsupervised%20URL-based%20web%20page%20classification%20system&rft.jtitle=Knowledge-based%20systems&rft.au=Hern%C3%A1ndez,%20Inma&rft.date=2014-02&rft.volume=57&rft.spage=168&rft.epage=180&rft.pages=168-180&rft.issn=0950-7051&rft.eissn=1872-7409&rft_id=info:doi/10.1016/j.knosys.2013.12.019&rft_dat=%3Cproquest_cross%3E1567086688%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c418t-d6c441edb361af93b139f0bc1f12dbbab0a6d6919eaa913187af40044abf55a33%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1567035251&rft_id=info:pmid/&rfr_iscdi=true