Loading…

Deep web data extraction

Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the deep web generally have limitations such as the inability to check the similarity of tree structures accurately. Our study shows that data records located in the deep web do n...

Full description

Saved in:
Bibliographic Details
Main Author: Jer Lang Hong
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the deep web generally have limitations such as the inability to check the similarity of tree structures accurately. Our study shows that data records located in the deep web do not only share similar visual properties and tree structures, but they are also related semantically in their contents. As such we are able to propose an ontological technique using existing lexical database for English (WordNet) for the extraction of data records from deep web pages. Wrappers designed based on ontological technique are able to reduce the number of potential data regions identified for data extraction, thus improve the data extraction accuracy. In this study, we use visual cue from the underlying browser rendering engine to locate and extract the relevant data region from the deep web by measuring the text and image sizes of data records. Experimental results show that our technique is robust and performs better than the existing state of the art wrappers. Unlike existing ontological based wrappers, our wrapper is domain independent and is able to extract wide range of data records with different structures.
ISSN:1062-922X
2577-1655
DOI:10.1109/ICSMC.2010.5642466