"Tree Pattern Aggregation for Scalable XML Data Dissemination"
Abstract
With the rapid growth of XML-document traffic on the Internet, scalable
content-based dissemination of XML documents to a large, dynamic group of
consumers has become an important research challenge.
To indicate the type of content that they are interested in, data consumers
typically specify their subscriptions using some XML pattern specification
language (e.g., XPath).
Given the large volume of subscribers, system scalability and efficiency
mandate the ability to aggregate
the set of consumer subscriptions to a smaller set of content specifications,
so as to both reduce their storage-space requirements as well as speed up the
document-subscription matching process.
In this paper, we provide the first systematic study of subscription aggregation
where subscriptions are specified with tree patterns (an important subclass
of XPath expressions).
The main challenge is to aggregate an input set of tree patterns into a
smaller set of generalized tree patterns such that: (1) a given space constraint
on the total size of the subscriptions is met, and (2) the loss in precision
(due to aggregation) during document filtering is minimized.
We propose an efficient tree-pattern aggregation algorithm that makes effective
use of document-distribution statistics in order to compute a precise
set of aggregate tree patterns within the allotted space budget.
As part of our solution, we also develop several novel algorithms for tree-pattern
containment and minimization, as well as "least-upper-bound" computation for a set
of tree patterns.
These results are of interest in their own right, and can prove useful in other
domains, such as XML query optimization.
Extensive results from a prototype implementation validate
our approach.
[
camera-ready paper
(pdf)
(ps.gz)
|
my talk slides
(ppt)
]
Copyright © 2002, VLDB Endowment.
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage, the
VLDB copyright notice and the title of the publication and its date appear, and
notice is given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee and/or special
permission from the Endowment.