"XTRACT: A System for Extracting Document Type Descriptors from XML Documents"
Abstract
XML is rapidly emerging as the new standard for data representation and
exchange on the Web.
An XML document can be accompanied by a Document Type Descriptor (DTD) which
plays the role of a schema for an XML data collection. DTDs contain valuable
information on the structure of documents and thus have a crucial role in
the efficient storage of XML data, as well as the effective formulation and
optimization of XML queries.
In this paper, we propose XTRACT, a novel system for inferring a DTD schema
for a database of XML documents. Since the DTD syntax incorporates the full
expressive power of regular expressions, naive approaches typically fail to
produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms
employ a sequence of sophisticated steps that involve: (1) finding patterns in
the input sequences and replacing them with regular expressions to generate
"general" candidate DTDs, (2) factoring candidate DTDs using adaptations of
algorithms from the logic optimization literature, and (3) applying the Minimum
Description Length (MDL) principle to find the best DTD
among the candidates.
The results of our experiments with real-life and synthetic DTDs demonstrate the
effectiveness of XTRACT's approach in inferring concise and semantically
meaningful DTD schemas for XML databases.
Copyright © 2000, Association for Computing Machinery, Inc. (ACM).
Permission to make digital/hard copy of all or part of this material without
fee is granted provided that copies are not made or distributed for profit or
commercial advantage, the ACM copyright/server notice, the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery, Inc. (ACM).
To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.