Loading…

A cost-effective method for detecting web site replicas on search engine databases

Identifying replicated sites is an important task for search engines. It can reduce data storage costs, improve query processing time and remove noise that might affect the quality of the final answers given to the user. This paper introduces a new approach to detect web sites that are likely to be...

Full description

Saved in:
Bibliographic Details
Published in:Data & knowledge engineering 2007-09, Vol.62 (3), p.421-437
Main Authors: da Costa Carvalho, André Luiz, de Moura, Edleno Silva, da Silva, Altigran Soares, Berlt, Klessius, Bezerra, Allan
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Identifying replicated sites is an important task for search engines. It can reduce data storage costs, improve query processing time and remove noise that might affect the quality of the final answers given to the user. This paper introduces a new approach to detect web sites that are likely to be replicas in a search engine database. Our method uses the websites’ structure and the content of their pages to identify possible replicas. As we show through experiments, such a combination improves the precision and reduces the overall costs related to the replica detection task. Our method achieves a quality improvement of 47.23% when compared to previously proposed approaches.
ISSN:0169-023X
1872-6933
DOI:10.1016/j.datak.2006.08.010