"SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables"
by
Shivnath Babu,
Minos Garofalakis, and
Rajeev Rastogi.
Proceedings of ACM SIGMOD'2001,
Santa Barbara, California, May 2001, pp. 283-294.
Abstract
While a variety of lossy compression schemes have been developed
for certain forms of digital data (e.g., images, audio, video),
the area of lossy compression techniques for arbitrary data tables
has been left relatively unexplored. Nevertheless, such techniques
are clearly motivated by the ever-increasing data collection rates
of modern enterprises and the need for effective, guaranteed-quality
approximate answers to queries over massive relational data sets.
In this paper, we propose SPARTAN, a system that takes advantage of
attribute semantics and data-mining models to perform lossy compression
of massive data tables.
SPARTAN is based on the novel idea of exploiting predictive data
correlations and prescribed error tolerances for individual attributes
to construct concise and accurate Classification and Regression Tree (CaRT)
models for entire columns of a table.
More precisely, SPARTAN selects a certain subset of attributes
for which no values are explicitly stored in the compressed table;
instead, concise CaRTs that predict these values (within the
prescribed error bounds) are maintained.
To restrict the huge search space and construction cost of possible
CaRT predictors, SPARTAN employs sophisticated learning techniques
and novel combinatorial optimization algorithms.
Our experimentation with several real-life data sets offers convincing
evidence of the effectiveness of SPARTAN's model-based approach --
SPARTAN is able to consistently yield substantially better
compression ratios than existing semantic or syntactic compression tools
(e.g., gzip) while utilizing only small data samples for model inference.
[
camera-ready paper
(pdf)
(ps.gz)
|
Shivnath's talk slides
(ppt.gz)
]
Copyright © 2001, Association for Computing Machinery, Inc. (ACM).
Permission to make digital/hard copy of all or part of this material without
fee is granted provided that copies are not made or distributed for profit or
commercial advantage, the ACM copyright/server notice, the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery, Inc. (ACM).
To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.