Loading…

An unsupervised heuristic-based approach for bibliographic metadata deduplication

► An efficient and effective approach for bibliographic metadata record deduplication. ► Up to 188% improvement in the quality of metadata deduplication. ► Up to 44% of failure cases solved by the proposed similarity functions. Digital libraries of scientific articles contain collections of digital...

Full description

Saved in:
Bibliographic Details
Published in:Information processing & management 2011-09, Vol.47 (5), p.706-718
Main Authors: Borges, Eduardo N., de Carvalho, Moisés G., Galante, Renata, Gonçalves, Marcos André, Laender, Alberto H.F.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:► An efficient and effective approach for bibliographic metadata record deduplication. ► Up to 188% improvement in the quality of metadata deduplication. ► Up to 44% of failure cases solved by the proposed similarity functions. Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These metadata standards may be heterogeneous in both, content and structure. All of this implies that many records may be duplicated in the repository, thus affecting the quality of services, such as searching and browsing. In this article we present an approach that identifies duplicated bibliographic metadata records in an efficient and effective way. We propose similarity functions especially designed for the digital library domain and experimentally evaluate them. Our results show that the proposed functions improve the quality of metadata deduplication up to 188% compared to four different baselines. We also show that our approach achieves statistical equivalent results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process.
ISSN:0306-4573
1873-5371
DOI:10.1016/j.ipm.2011.01.009