Loading…

Grammars have exceptions

Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their...

Full description

Saved in:
Bibliographic Details
Published in:Information systems (Oxford) 1998-12, Vol.23 (8), p.539-565
Main Authors: Crescenzi, Valter, Mecca, Giansalvatore
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva, a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.
ISSN:0306-4379
1873-6076
DOI:10.1016/S0306-4379(98)00028-3