RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Communities from Seed Sets

来源: 作者:unkonwn 时间:2004-12-04 点击:

Keywords

community finding, link analysis, graph conductance, random walks, seed sets 数据挖掘研究院

1 Introduction

In this paper, we present a detailed study of the following problem: given a small but cohesive "seed set" of web pages, expand this set to generate the enclosing community (or communities). This seed expansion problem has been addressed by numerous researchers as an intermediate step of various graph-based analyses on the web. To our knowledge, however, it has never been called out as an interesting primitive in its own right. As we will show, the mathematical structure underlying the problem is rich and fruitful, and allows us to develop algorithms that perform dramatically better than naive approaches. 数据挖掘研究院

The seed expansion problem first came into prominence in 2000, when Jon Kleinberg introduced the HITS algorithm [7]. That algorithm used a search engine to generate a seed set, and then performed a fixed-depth neighborhood expansion in order to generate a larger set of pages upon which the HITS algorithm was employed. This general recipe has since seen broad adoption, and is now a common technique for local link-based analysis. Variants of this technique have been employed in community finding [8], in finding similar pages [1], in variants of HITS, PageRank [13], and TrustRank [5], and in classification [2]. More sophisticated expansions have been applied by Flake et al. [4] in the context of community discovery.

数据挖掘研究院

Alongside this body of work in web analysis, the theoretical computer science community has developed algorithms that find cuts of provably small conductance within a carefully expanded neighborhood of a vertex [12]. Intuitively, these methods are efficient because they make non-uniform expansion decisions based on the structure revealed during exploration of the neighborhood surrounding the vertex. Our results are based on these techniques, and we present modified versions of these methods and theorems that apply to the seed set expansion problem.

The actual computation is done by simulating a ``truncated′′ random walk for a small number of steps, starting from a distribution concentrated on the seed set. In each step of the truncated walk, probability is spread as usual to neighboring vertices, but is then removed from any vertex with probability below a certain threshold. This bounds the number of vertices with nonzero probability, and implicitly determines which vertices are examined. After each step in the expansion process, we examine a small number of sets determined by the current random walk distribution, and eventually choose one of these sets to be the community for the seed.

数据挖掘研究院

1.1 Discussion of seed set expansion

The graphs we consider in our experiments have small diameter and a small average distance between nodes, so the number of nodes within a fixed distance of the seed set grows quickly. Nonetheless, these graphs contain distinct communities with relatively small cutsizes. We assume that the seed set is largely contained in such a community, which we refer to as a target community. In many of the graphs we consider, the seed is contained in a nested sequence of target communities. The results in section 2.5 show that if a good target community exists for the seed, then the random walk method produces some community that is largely contained within this target.

数据挖掘研究院

When expanding the seed with a truncated walk, a target community serves as a bottleneck, containing much of the probability for a significant number of steps. Since probability is removed from low-probability vertices, this prevents the support of the walk from expanding quickly beyond the bottleneck. In contrast, expanding the seed set using a fixed-depth expansion entirely ignores the bottleneck defining the community. Some branch of the BFS tree is likely to cross the bottleneck and rapidly expand in the main graph before a large fraction of the nodes in the community have been reached. Thus a fixed depth expansion will be a bad approximation to the community, and might also be an impractically large ``candidate set′′ for further processing.

数据挖掘研究院

The goal of examining a small number of nodes is clear enough, but we also need to pin down what we mean by a ``good′′ community containing a given seed set. We will do this by describing the limitations of several possible ways to define good communities.

数据挖掘研究院

Several papers including [4] have defined a community to be a subgraph bounded by a small cut, which can be obtained by first growing a candidate set using BFS, and then pruning it back using ST max-flow. Unfortunately, a small seed set will often be bounded by a cut that is smaller than the cut bounding the target community, so the mininum cut criterion will not want to grow the seed set. Flake et al. address this problem by performing this process several times, adding nodes from the candidate set at each step to ensure expansion. 数据挖掘研究院

Another idea for ensuring a reasonable expansion of the seed set that is less ad hoc than most is to optimize for conductance instead of cutsize alone. Graph conductance (aka the normalized cut metric) is a quotient-style metric that provides an incentive for growing the seed set. Unfortunately, the conductance score can be improved by adding barely related nodes (or even a disconnected component) to the seed set. We need to ensure that only nearby nodes are added to the seed set.

数据挖掘研究院

This raises a subtle but important point. While stronger parametric flow methods do exist for finding low-conductance cuts within an expanded neighborhood of the seed set, our walk-based method′s weaker spectral-style guarantee on conductance is counterbalanced by a valuable ``locality′′ property which ensures that we output a community consisting of nodes that are closely related to the seed set. In practice we try to get the best of both worlds by cleaning up the walk-based cuts with a conservative use of flow that does not disturb this locality property very much. Experiments show that the results compare well with stronger optimization techniques. 数据挖掘实验室

The random walks techniques are remarkable because they produce communities with conductance guarantees, yet can be computed locally rather than on the whole graph, often while touching fewer nodes than BFS would. The best low-conductance cuts are more likely than minimum cuts to non-trivially expand the seed set, but still ensure a small boundary defining a natural community. Finally, in a certain sense that we will discuss in section 2.5, the added nodes are close to the seed set.

数据挖掘研究院

2 Random walk methods

As part of their work on graph partitioning and graph sparsification [12], Spielman and Teng present a method for finding cuts based on the mixing of a random walk starting from a single vertex. In that paper, the method is used as a subroutine to produce balanced separators and multiway partitions.

数据挖掘研究院

In this section, we describe how to use the techniques developed by Spieman-Teng to find communities from a seed set, and describe how the size and quality of the seed set affects the results. In sections 2.1-2.3 we present background on random walks and sweeps. In section 2.4 we show that any seed set that makes up a significant fraction of a target community will produce good results with our seed expansion method, and that most seed sets that are chosen randomly from within a target community will also produce good results. In section 2.5 we present a stripped-down version of the local partitioning techniques developed by Spielman-Teng, and describe quantitatively how the size and quality of the seed set affect the running time and guarantees. In section 2.6 we describe the truncation method used by Spielman-Teng, and describe the heuristics and practical modifications we have employed to keep the running time and total number of vertices examined small. 数据挖掘研究院


2.1 Notation

The results in this section apply to graphs that are unweighted and undirected. Let $A$ denote the adjacency matrix of the graph under consideration, and let $D$ be the diagonal matrix where $D_{i,i}=d(v_{i})$, the degree of the $i$th vertex. The volume of a set of vertices is
egin{displaymath}{
m Vol}(S) = sum_{u in S} d(u).end{displaymath}

The edge border is denoted
egin{displaymath}partial(S) = left { , {u,v} , mid , {u,v} in E, u in S, v 
ot in S , 
ight },end{displaymath}

the cutsize is denoted $vert partial(S) vert$, and the conductance of a set of vertices is 数据挖掘研究院
egin{displaymath}phi(S)= frac {vert partial(S) vert } { min left ( {
m Vol}(S), {
m Vol}(ar{S}) 
ight)}.end{displaymath}

This definition of conductance should not be confused with the conductance associated with a particular random walk on the graph. In particular, the conductance associated with a lazy random walk is a factor of 2 smaller.


2.2 Lazy random walks and sweeps

Given a seed set $S$ for which we hope to find a community, we begin with the probability distribution $p_{0}=psi_{S}$, where
egin{displaymath} psi_{S} = left { egin{array}{ll} d(x) / {
m V... ...in S$},  0 & mbox{otherwise}. end{array} 
ight . end{displaymath}
数据挖掘研究院
We then simulate several steps of a lazy random walk, computing the probability distributions $p_{t}$, where
egin{displaymath}p_{t} = M^{t} p_{0},end{displaymath}

and where $M$ is the lazy random walk transition matrix
egin{displaymath}M = frac{1}{2}(I + A D^{-1}).end{displaymath}

If the graph is connected, then $p_{t}$ converges to the stationary distribution $psi_{V}$. We are not interested in this limiting distribution, but rather in the distributions obtained after a small number of walk steps. We compute $p_{t}$ for all $t$ up to some specified time $T$. After each step, we sort the vertices in descending order according the degree-normalized probabilities 数据挖掘研究院
egin{displaymath}r_{t}(v)=p_{t}(v)/d(v),end{displaymath}

letting $v^{t}_{i}$ be the $i$th vertex in this order, so that
egin{displaymath}r( v^{t}_{i} ) geq r( v^{t}_{i+1} ).end{displaymath}
数据挖掘研究院
This ordering defines a collection of sets $S^{t}_{0}, dots, S^{t}_J$, where $ S^{t}_j = { v^{t}_{i} mid 1 leq i leq j } $, and $J$ is the number of vertices with nonzero values of $p(u)/d(u)$. The cutsizes, volumes, and conductances for every set $S^{t}_1, ldots, S^{t}_J$ can be computed in time proportional to ${
m Vol}(S^{t}_J)$, by determining the change to $S^{t}_{i}$ due to the addition of vertex $v^{t}_{i+1}$. This process is referred to as a sweep. We will show that under certain conditions one of the sweep sets $S^{t}_{j}$ will be a good community for the seed $S$. To make the degree-normalized distribution $p_{t}(u)/d(u)$ more intuitive, one can view each vertex as consisting of $d(u)$ minivertices, two minivertices ${ x_{ ( {u} , {v} ) } }$ and ${ x_{ ( {v} , {u} ) } }$ for each undirected edge ${u,v}$. The vector $r_{t}$ can be interpreted as a probability distribution $q_{t}$ on minivertices by letting $q_{t}({ x_{ ( {u} , {v} ) } })=r_{t}(u)$. During the $t$-th step of the lazy walk, $frac{1}{2} q_{t}({ x_{ ( {u} , {v} ) } })$ is the amount of probability sent from $u$ to $v$. To bound the mixing of the lazy random walk, we consider an ordering of the minivertices so that $q_{t}(x^{t}_{i}) geq q_{t}(x^{t}_{i+1})$. Using the shorthand notation $q_{t}(i) = q_{t}(x^{t}_{i})$, we then define

数据挖掘研究院


egin{displaymath}P_t(k) = sum_{i=1}^{k} q_{t}(i), end{displaymath}

and the related quantity
egin{displaymath}H_{t}(k) = left [ P_{t}(k) - frac{k}{2m} 
ight ].end{displaymath}

The functions $P_t(k)$ and $H_t(k)$ provide a strong way to bound how well the walk has mixed; $P_t(k)$ is the maximum amount of probability on any set of $k$ minivertices at time $t$, and thus the amount of probability that flows over any set of $k$ edges at time $t$ is at most ${frac{1}{2}}P_t(k)$. The maximum amount of probability on a set of vertices $A$ is at most 数据挖掘研究院
egin{displaymath}P_{t}( {
m Vol}(A) ) = H_{t}( {
m Vol}(A) ) + frac{{
m Vol}(A)}{{
m Vol}(G)},end{displaymath}

and the $L_{1}$-distance between $p_t$ and the stationary distribution $psi_{V}$ can be written
egin{displaymath}vert p_t - psi_{V} vert = 2 max_{A subseteq V} langle... ...{V},1_{A} 
angle = 2 max_{k} left [ H_{t}(k) 
ight ] .end{displaymath}

We will need the following monotonicity lemma, which is a corollary of the result of Lovász and Simonovits [10].
Lemma 1   $P_{t+1}(k) leq P_{t}(k)$, for any $k in [0,2m]$.


2.3 Mixing of lazy random walks

For large seed sets, the initial distribution $p_{0}$ may already be moderately well mixed, which will lead to better bounds on $P_{t}(k)$. In particular, we have the following lemma.
Lemma 2   If $p_{0}=psi_{S}$, then
egin{displaymath}H_{0}(k) leq c min ( sqrt{k}, sqrt{m-k} ),end{displaymath}

where $c = sqrt{frac{1}{{
m Vol}(S)}}$.

egin{proof} This follows easily from the definition of $psi_{S}$. end{proof}
The following lemma due to Spielman and Teng is a strengthening of a result of Lovász and Simonovits [10] [11]. It shows that the lazy random walk mixes well unless one of the sweeps sets $S^{t}_{j}$ has both small conductance and large drop in degree-normalized probability between the vertices inside and outside of $S^{t}_{j}$.
Lemma 3   Let $p_{0}=psi_{S}$ and let $phi$ and $alpha$ be some fixed constants. Either

数据挖掘实验室


egin{displaymath} H_T (k) leq H_{0}(k) { left (1-frac{phi^{2}}{8} 
ight) }^{T} + alpha T, end{displaymath} (1)

or there exists a sweep cut $S^{t}_{j}$ with $t leq T$ such that
  1. $phi(S^{t}_{j}) leq phi$, and
  2. $q_{t}(k_{0}-phi ar{k}_{0}) - q_{t}(k_{0}+phi ar{k}_{0}) geq frac{2alph... ...re } k_{0}={
m Vol}(S^{t}_{j}) mbox{ and } ar{k}_{0}=min (k_{0},2m-k_{0}) $ .
数据挖掘实验室
egin{proof} This is Lemma~3.7 of cite{ST04} end{proof}


2.4 Good seed sets

Lemma 3 shows that in the absence of a good sweep cut, the walk mixes rapidly. If we can show that the walk does not mix as rapidly as the lemma should imply, then one of the sweep cuts must be good. One way to do this is to present a target community ${C}$ that has small volume, but contains a large amount of probability from $p_{T}$, and use this to place a lower bound on $H_T ( {
m Vol}({C}))$. The amount of probability that has escaped from ${C}$ after $T$ steps, which we will write ${langle { M^{T} psi_{S} },{ar{{C}} } 
angle}$, depends on both the target community ${C}$ and the seed set $S$. For every set ${C}$ with small conductance, there are a variety of seed sets for which ${langle { M^{T} psi_{S} },{ar{{C}} } 
angle}$ is not much larger than $phi({C}) cdot T$. To motivate this, consider the amount of escaping probability when the seed set is the entire target set ${C}$. The following lemma is also due to Spielman-Teng.
Lemma 4   For any step $t$, ${langle { M^{t} psi_{{C}} },{ ar{{C}} } 
angle} leq {frac{1}{2}}phi({C}) cdot t$ .
数据挖掘研究院
egin{proof} Since our starting distribution is $p_{0}=psi_{{C}}$, each miniv... ...C}}$ after $t$ steps is at most ${frac{1}{2}}phi({C}) cdot t$. end{proof}

Any set that is fairly large and nearly contained in the target community is also a good seed set. The bounds in the lemma below are weak for smaller seed sets, but are sufficient for seed sets that make up a significant fraction of the target community. 数据挖掘实验室

Lemma 5 (Large seed sets)   Let $S$ be a seed set such that ${
m Vol}({C}) leq eta {
m Vol}(S)$, and ${
m Vol}(S cap {C}) geq (1-delta) {
m Vol}(S)$. Then, for any step $t$,
egin{displaymath}{langle { M^{t} psi_{S} },{ ar{{C}} } 
angle} leq (1-delta) {frac{1}{2}}eta phi({C}) cdot t + delta.end{displaymath}
数据挖掘研究院

egin{proof} Assume for now that $S subseteq {C}$. The maximum amount of pr... ...) {frac{1}{2}}eta phi({C}) cdot t + delta . end{eqnarray*} end{proof}

Seed sets chosen randomly from within a target community are also likely to be good seed sets for that community. The following result is stated without proof, since it follows by applying a Chernoff-type bound to a weighted sum of independent random variables in the usual way. 数据挖掘研究院

Lemma 6 (Random seed sets)   Let $S$ be a set where each vertex from ${C}$ is included in $S$ independently with probability $epsilon$. Let $Delta = max_{v in {C}} d(v)$. For any single specific time $t$, the bound 数据挖掘研究院
egin{displaymath}{langle { M^{t} psi_{S} },{ ar{{C}} } 
angle} leq phi({C}) tend{displaymath}

holds with probability at least
egin{displaymath} 1 - left ( e ^ { frac {- epsilon t phi ({C}) {
m V... ... {- epsilon {
m Vol}({C}) } {32 Delta } } 
ight ) . end{displaymath}

数据挖掘研究院
2.5 Communities from a seed set

The following is a stripped-down version of a theorem of Spielman-Teng, extended for a seed set instead of a single starting vertex.

Theorem 1   Given a seed set $S$, choose two parameters $phi$ and $eta$, let $T=frac{4}{phi^{2}} ln (16 eta)$ and compute $p_{1}, ldots , p_{T}$ starting from $p_{0}=psi_{S}$. Assume that there exists a set of vertices ${C}$ such that ${
m Vol} ({C}) leq {frac{1}{2}}{
m Vol}(G)$, and such that $S$ is a fairly good seed set for ${C}$ in that ${langle { M^{T} psi_{S}},{ar{{C}}} 
angle} leq a phi({C}) cdot T$ . If the parameters $phi$ and $eta$ have been chosen so that
  1. $frac{{
m Vol}({C})}{{
m Vol}(S)} leq eta$, and
  2. $phi ({C}) leq frac{phi^{2}}{ 32 a ln (16 eta)}$ ,
then there exists a sweep cut $S^{t}_{j}$ with $t leq T$ such that
  1. $phi(S^{t}_{j}) leq phi$, and
  2. $q_{t}(k_{0}-phi ar{k}_{0}) - q_{t}(k_{0}+phi ar{k}_{0}) geq frac{phi... ...re } k_{0}={
m Vol}(S^{t}_{j}) mbox{ and } ar{k_{0}}=min (k_{0},2m-k_{0}) $ .
数据挖掘研究院
egin{proof} % latex2html id marker 384 If equation~eqref{cutormix} from L... ... implies that one of the sweep cuts has the desired properties. end{proof}

The sweep cut $S^{t}_{j}$ can be viewed as a community of the seed $S$. In addition to having small conductance, there is a significant drop in probability outside ${S^{t}_{j}}$ by Theorem 1. Figure 3 shows an example of this simultaneous small conductance cut and probability drop.

数据挖掘研究院

While these guarantees on ${S^{t}_{j}}$ are fairly clear, the way that the community is related to the seed set is more subtle. The most obvious thing we can say about this relationship is probably the most important: the community is obtained by taking all vertices where the value of $r_{t}(v)=p_{t}(v)/d(v)$ is above a certain threshold. If we view $r_{t} (v)$ as a measure of closeness to the seed set, this is a strong guarantee.

数据挖掘研究院

Unfortunately, $r_{t} (v)$ is not a good measure of closeness to the seed set when $t$ becomes large. As $t$ tends to infinity, $r_{t} (v)$ tends to a constant, and the ordering of the vertices determined by $r_{t} (v)$ tends to the same ordering determined by the second-largest eigenvector of $I+D^{-1}A$, which is independent of the seed set if the graph is connected. This means that the vertices from the seed set are not guaranteed to have the largest values of $r_{t} (v)$, and so the seed set is not necessarily contained in the resulting community. However, we have an upper bound on the number of walk steps taken, and we claim that $r_{t} (v)$ is a good measure of relationship to the seed set for small values of $t$. In practice, we have observed that requiring the community to be determined by a threshold on the values of $r_{t} (v)$ rules out more degenerate behavior than requiring only that the community contains the seed.

数据挖掘研究院

We cannot hope that the resulting community ${S^{t}_{j}}$ will closely match the target community ${C}$, without placing additional assumptions on the communities inside ${C}$. Instead, the target community is used to ensure that some community for $S$ will be found, and this community is likely to be mostly contained within the target community. This can be made precise if a stronger restriction is placed on $phi({C})$. Spielman-Teng showed that if $phi(C)$ is roughly $phi^{3}$, then the gap in probability described in Theorem 1 can be used to show that the majority of vertices in ${S^{t}_{j}}$ are contained in ${C}$.


2.6 Truncation

The actual subroutine created by Spielman-Teng, called Nibble, has a stronger guarantee than the method we have presented here. One aspect of Nibble that we would like to reproduce is the ability to find a small community near the seed set while keeping the number of vertices examined small. This is accomplished by using truncated walk distributions in place of exact walk distributions.

数据挖掘研究院

In the truncated walk, the probability on any vertex where $r_{t}(v) leq epsilon$ is set to 0 after each step. This ensures that the support of the truncated walk $Supp_{t}$ is small, which gives a bound on the running time of the algorithm, since the time required to perform a step of the random walk and a sweep depends on

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?