Loading…

Fast Extraction of Article Titles from XML Based Large Bibliographic Datasets

On a daily basis, large numbers of research articles are published world-wide. Usually the meta data of these articles are made available in bibliographic datasets. The format of such bibliographic dataset is generally in xml format. This format is generally used for data transfer between systems an...

Full description

Saved in:
Bibliographic Details
Published in:Procedia technology 2016, Vol.24, p.1263-1267
Main Authors: Swaraj, K.P., Manjula, D.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:On a daily basis, large numbers of research articles are published world-wide. Usually the meta data of these articles are made available in bibliographic datasets. The format of such bibliographic dataset is generally in xml format. This format is generally used for data transfer between systems and for data processing by systems. An xml bibliographic dataset will have many article tags and its sub tags specify the meta data associated with each article. Usually an article tag will be associated with many meta data sub tags. Extraction of article title tags is essential for domain based classification of articles. This extraction and subsequent classification of research article titles present in a bibliographic dataset is a laborious task which is usually done manually. Hence a fast and efficient technique is essential to extract titles from datasets and is the need of the hour. In this article, a fast map reduced based approach is proposed to quickly extract research articles titles from bibliographic dataset. Articles from DBLP bibliographic dataset of past 3 years is used in this study. Hadoop Map reduce method is used to speed up the title extraction process from large xml based bibliographic datasets. Performance analysis revealed that the proposed method is quick, efficient and highly scalable.
ISSN:2212-0173
2212-0173
DOI:10.1016/j.protcy.2016.05.108