"Scalable Data Mining with Model Constraints"
by Minos Garofalakis and
Rajeev Rastogi.
SIGKDD Explorations,
Vol. 2, No. 2, December 2000 (Special Issue on Scalable Data Mining),
pp. 39-48.
Abstract
Data mining can be abstractly defined as the process of extracting
concise and interesting models (or, patterns) from large amounts of data.
Unfortunately, conventional mining systems provide users with only
very restricted mechanisms for specifying models of interest.
As a consequence, the mining process is typically characterized by
lack of focus and users often end up paying computational costs that
are inordinately high compared to the specific models/patterns of
interest.
Exploiting user-defined model constraints during the mining process
can help alleviate this problem and ensure system performance
that is commensurate with the level of user focus.
Attaining such performance goals, however, is not straightforward and,
typically, requires the design of novel data mining algorithms that
make effective use of the model constraints.
In this paper, we provide an overview of our recent work on
scalable, constraint-based algorithms for
(1) decision tree construction with size and accuracy constraints
for the desired decision tree model, and (2) sequential pattern
extraction in the presence of structural, regular expression constraints
for the target patterns.
By "pushing" the model constraints inside the mining process,
our algorithms give mining users exactly the models that they are
looking for, while achieving performance speedups that often exceed
one order of magnitude.
Further, our work on sequential pattern mining has uncovered some
valuable insights into the tradeoffs that arise when complex constraints
that do not subscribe to "nice" properties (like anti-monotonicity)
are integrated into the mining process.
We argue that, in general, a cost-based approach (similar
to that employed in conventional query optimizers) is needed to explore
these tradeoffs in a principled manner and produce effective
execution plans for ad-hoc mining queries.
Copyright © 2000, Association for Computing Machinery, Inc. (ACM).
Permission to make digital/hard copy of all or part of this material without
fee is granted provided that copies are not made or distributed for profit or
commercial advantage, the ACM copyright/server notice, the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery, Inc. (ACM).
To copy otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.