1 Introduction
Terabyte online databases, consisting
records, are becoming common as the
of millions of
price of online
storage decreases. As a result, efficient data clustering
plays an increasingly significant role in enhancing the
performance of database systems. The objective of
clustering related data items on storage devices is to
minimize the cost of query processing. Essentially,
data clustering attempts to place related data, which
are required consecutively on some query access path,
physically close to each other on the storage to minimize the number of disk accesses.
The problem of data clustering can be formulated as
a graph partitioning problem (GPP). A database can be
represented as a graph G = (N, A) in which
l
b
b
l
l
N denotes the set of nodes representing the data
items;
A denotes the set of edges which connect related
data items if they are referenced consecutively in a
query;
a weight Wj > 1 is assigned to each arc a~ G A; the
weight represents the relative significance (e. g., the
frequency) of coreferencing the related data items
connected by the edge;
a partitioning divides N approximately evenly into
v disjoint subgraphs N1, N2, . . . . N., where v is a
given parameter; we assume that each subgraph
represents a segment (e.g., page, track, cylinder,
etc. ) with a fixed capacity, and the size of a data
item is also fixed;
a cost function C is defined as ~ Wj where the
summation ranges over those arcs aj G A which
connect nodes in two different subgraphs N* and Nt,
(1 S i,i’ <v, and i # i’).
We note that the nodes can be the objects of
00DBs (object-oriented databases) or tuples of RDBs
(relational databases); the edges represent the aggregate
(is-part-of) relationships or the primary-key-foreign-key
(joining) links between the two data items represented
by the two end nodes of each edge. The goal of data
clustering is to find a partitioning of the graph G such
that the cost function C = ~ Wj is minimized. 数据挖掘实验室

