## Spielman-Teng-A: A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning

Spielman-Teng’s 2004 paper [ST04] on almost-linear Laplacian linear solver has been divided and re-written into three very technically involved ones [ST08a], [ST08b] and [ST08c]. I am planning to write memos for all of them, while this post is the Part I for the first paper [ST08a: A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning].

This paper at this moment is not the most efficient one, see for instance [Andersen-Peres’09] using evolving sets. Also, in the earlier version [ST04] has provided a provably worse algorithm, however, that one is much easier to understand. This current post of mine is shooting for the harder [ST08a] unfortunately.

## Outline

The goal of [ST08a] is to construct a local clustering algorithm Nibble, that satisfies for any cut $(S,V-S)$ that is sparse, if one starts from a vertex $v\in S$ and perform random walk, he can almost recover the set $S$ and the running time is almost-linear with respect to $|S|$. One can then further apply this local algorithm multiple times, to construct Partition which is essentially a pseudo-approximation to balanced cut. Partition is also of further use in [ST08b] to construct spectral sparsifiers for a graph. Formally,

Theorem 2.1 (Nibble). Nibble($G,v, \phi, b$) is an algorithm that runs in time $O(2^b (\log^6 m)/\phi^4)$, and moreover (where $\mu$ is the $vol$ volume function):

• (N.1) If the output $C\subset V$ is non-empty, then $\Phi(C)\leq \phi$ and $\mu(C)\leq (5/6)\mu(V)$.
• (N.2) Each set $S\subseteq V$ satisfying $\mu(S)\leq (2/3)\mu(V)$ and $\Phi(S)\leq O(\phi^2/\log^3 m)$, has a subset $S^g\subseteq S$ such that, (a) $\mu(S^g)\geq \mu(S)/2$, and (b) $\forall v\in S^g$, if the output $C=$Nibble$(G,v,\phi,b)$ is nonempty, then $\mu(C\cap S)\geq 2^{b-1}$.
• (N.3) Each such $S^g$ can be partitioned into subsets $S_0^g,\dots,S_l^g$ such that if $v\in S_b^g$, then  Nibble$(G,v,\phi,b)$ will not be empty.

In the earliest paper [ST04], the requirement on conductance of $S$ was actually $\Phi(S)\leq O(\phi^3/\log^2 m)$. This has been improved in [ST08a].

## The Algorithm for Nibble

Basically, Nibble will try to compute the random walk probability vector to the first $\log^2 m / \phi^2$ steps. It starts with vector $q_0=\chi_v$, and then perform transition matrix $M$ at each step $q_t=M q_{t-1}$, but zero out those indices $i$‘s whose probability is smaller than $\varepsilon \cdot deg(i)$, where $\varepsilon=\phi^2 / (\log ^3 m 2^b)$.

At each step $j$, it sorts all vertices according to probabilities from large to small, and let $S_j(q_t)$ denote the set of $j$ vertices with largest normalized probabilities $q_t(i)/deg(i)$. Then, Nibble will terminate and output $S_j(q_t)$ if all of the following are satisfied:

• (C.1) $\Phi(S_j(q_t)) \leq \phi$,
• (C.2) $\mu(S_j(q_t)) \leq (5/6) \mu(V)$,
• (C.3) $\mu(S_j(q_t)) \geq 2^b$, and
• (C.4) the first (sorted by normalized probabilities) $2^b$ (in terms of volume) vertices all have normalized probabilities at least $\frac{q(\tilde{j})}{deg(\tilde{j})} \geq \frac{1}{\log m \cdot 2^b}$.

Of course, it won’t be an easy job to show that all of them are likely to be satisfied, and in fact those are the duties of (N.3). Let us do two easy things first, that is to show (N.1) and (N.2), while since both of them assume that a non-empty set is returned, we automatically have (C.1-4) as part of the assumptions for free.

## Proof of (N.1) and (N.2)

In fact, (N.1) immediately follows from (C.1) and (C.2) so there is nothing to say. We want to show that (N.2) follows from (C.3) and (C.4).

It is quite famous in random walk theory that, starting from the stationary distribution in $S$, after $t$ steps, the amount of probability mass that stays inside $S$ is lower bounded $1-t\Phi_G(S)/2$. (In fact, it is trivial to extend it to $(1-\Phi_G(S)/2)^t$, which was used by [Gharan-Trevisan’12] to achieve a different local partitioning algorithm.) As a consequence, using Markov bound, for at least half number of vertices in $S$, if we start from that single vertex (rather than the stationary distribution on $S$), the probability mass of going out of $S$ is at most $t\Phi_G(S)$. Define this to be the set $S^g$.

Next, we show (N.2). If we start from such a vertex $v\in S^g$, then because there is a $\log m$ gap between the conductance requirement on $S$ in (N.2), and the total number of steps we make for random walk, the probability mass that goes out of $S$ is upper bounded by $1/\log m$. However, using (C.4), for at least the first (sorted by normalized probabilities) $2^b$ (in terms of volume) vertices output by Nibble, they have probability mass at least $1/(\log m 2^b)$, this means, at most constant fraction of them could come from $\overline{S}$. By playing with the constants, one can ensure that at most half comes from outside, i.e., (N.2) is satisfied: $\mu(C\cap S)\geq 2^{b-1}$.

## High Level Idea for (N.3)

Before going into the details of proving (N.3), I’d love to provide my own overview to the proof. Since this is a very technically involved proof, I don’t think my overview would be helpful to anyone. Let me give a try:

• First of all, the total number of $\log^2 m / \phi^2$ random walk steps, should be decomposed into $\log m$ epochs, each having $\log m/\phi^2$ number of steps.
• Claim 1: there exists a good epoch such that for all $t$ in that epoch, (C.4) is satisfied. Intuitively, this is an epoch in which all first (in terms of volume) $2^b$ vertices are reached with descent probabilities, and at the same time, $b$ is chosen so that the probability mass is moving slowly at this moment. In other words, two things are satisfied simultaneously: random walk mixes well inside some $2^b$ range, and it slows down in terms of walking from inside to outside.
• Claim 2: within such epoch, there are still plenty of choices of $\log m/\phi^2$ times steps to stop. One can show that if all (C.1-3) fails in this entire epoch, then the random walk mixes very quickly on the entire graph. This will somehow contradict the choices of epochs, and blame it by saying: you should have stopped the random walk earlier.

## Notation on Integrated Probability

Now we need to make a definition for the integrated probability. A similar notion was first introduced by [Lovasz-Simonovits’90,’93]. For a probability vector $q_t$, for notational simplicity assume that the vertices are already sorted according to normalized probabilities, i.e., $q_t(1)/deg(1) \geq q_t(2)/deg(2) \geq \cdots$. I remark here that I will not distinguish between truncated probabilities and the real probabilities from now on.

Then, define $I(q_t,\cdot)$ to be a piecewise-linear (concave) function with end points $I(q_t,0)=0$, $I(q_t,deg(1))=q_t(1)$, $I(q_t,deg(1)+deg(2))=q_t(1)+q_t(2)$ and so on. We would always have $I(q_t,\mu(V))=1$ if we never did truncations. As an exercise, (C.4) above can be re-written as $I_x(q_t,2^b)\geq \frac{1}{\log m \cdot 2^b}$ where $I_x$ is the derivative of $I$.

It is not hard to verify that $I(q_{t+1},x)\leq I(q_t,x)$ for any $x$. More interestingly,

Lemma 2.9 (informal). for “any” $x$, if we denote $\hat{x}=\min\{x,2m-x\}$, then $I(q_{t+1},x)\leq \frac{1}{2}\big( I(p_t,x-2\phi \hat{x}) + I(p_t,x+2\phi \hat{x})\big)$.

Here by “any” I mean those values of $x$ that correspond to end points of the piecewise-linear function $I$, and $\phi$ is the conductance at the vertex corresponding to that endpoint.

This lemma was used in [Lovasz-Simonovits] to prove that, if a graph has set conductance at least $\phi$, then using Lemma 2.9, one can show that this intergral vector $I$ converges quickly into a linear function from $(0,0)$ to $(2m,1)$. The convergence rate is $(1-\phi^2)^t$.

## Refining $S^g$

Now, Spielman-Teng is going to do the following. Let $t_1:=\log m/\phi^2$ and define $t_h = h\cdot t_1$. Here $h = 0,1,\dots,\ell+1$ where $\ell:=\Theta(\log m)$ so between two consecutive $[t_{h-1}, t_h]$ is an epoch.

For each $h$, define $x_h$ to be the real number such that $I(q_{t_h},x_h) = \frac{h+1}{\log m}$. In other words, we have $\log m$ epochs and for the $h$-th epoch, we shoot for probability benchmark of $\frac{h+1}{\log m}$, and $x_h$ is the cut-off volume.

It is by definition that $x_h > x_{h-1}$. Now, if $x_\ell\geq 2m/\log m$ then define $h_v := \ell+1$; otherwise define $h_v := \min\{ h : x_h \leq 2 x_{h-1}\}$. It is not hard to show that $h_v$ is well defined. We will soon see that $[t_{h_v-1}, t_{h_v}]$ is the good epoch (for vertex $v$) that we are looking for.

Now, for each vertex $v\in S^g$, we can compute its $x_{h_v-1}$ and suppose that this real value is in $[2^b, 2^{b+1}]$, then we put such vertex $v$ into $S^g_b$. See the picture below for the definition of $h_v$, at least in the case $x_h \leq 2 x_{h-1}$.

## Claim 1

Starting from vertex $x\in S_b^g$, to show that $[t_{h_v-1}, t_{h_v}]$ is a good epoch, we want to show that for all $t\in (t_{h_v-1},t_{h_v}]$, condition (C.4) is satisfied. (Recall that C.4 is independent of the choice of $j$.) There are two possibilities:

• If $h_v$ satisfies that $x_{h_v}\leq 2x_{h_v-1}$, we know that to move to a higher benchmark, one only needs to at most double the volume. This means the slope $I_x(q_t,\cdot)$ must be as large as $\geq \frac{1}{\log m \cdot 2^B}$ in this range, which further implies $I_x(q_t,2^b)$ is large, i.e., (C.4) is satisfied. In details, $I_x (q_t, x_{h_v-1}) \approx \big( I(q_t, x_{h_v}) - I(q_t,x_{h_v-1}) \big) / (x_{h_v}-x_{h_v-1})$, but the numerator is almost $\frac{1}{\log m}$ by definition of $x$, while the denominator is almost $x_{h_v-1}\approx 2^b$ by definition of $S_b^g$.
• If $x_\ell \geq 2m/\log m$, this means the last benchmark, which is a constant $O(1)<1$ of the probability mass, is reached very late at $x_\ell$. By convexity, $I_x(q_t,x_\ell) \geq (1-O(1)) / 2m = \Omega(1/m)$. This easily implies (C.4).

So far, I hope that I have convinced you that (C.4) is always true within some good epoch $t\in (t_{h_v-1},t_{h_v}]$.

## Claim 2

This is a more involved one. Basically if (C.1) is never true, then by using Lemma 2.9 one can show that the random walk mixes very quickly during epoch $[t_{h-1},t_h]$ (letting $h=h_v$), and at the end of this epoch, $I(q_{t_h},x)<\frac{3x}{5m}+\sqrt{\hat{x}}\Big(1-\phi^2/2\Big)^{t_1}$. This will contradict with many things. For instance, if $x_\ell<2m/\log m$, then by plugging in $x=x_{h}\leq x_\ell < 2m/\log m$, one can see that this probability $I(q_{t_h},x)$ is $O(1/\log m)$ contradicting the choice of definition of $x_{h}$, which says that $I(q_{t_{h}},x_{h}) \geq (h+1)/\log m$.  (The other case when  $h=\ell+1$ is similar.)

Even though (C.1) can be true some times, one can similarly show that letting (C.2) or (C.3) be false is not a good idea either because the random walk mixes quickly. I don’t want to cover those two cases.

## Summary

The proof of Nibble in [ST08a] in my understanding is twice as difficult as that in [ST04], while the latter provides a weaker guarantee in conductance. I don’t have time to separately write a post on [ST04].

The high level idea for Nibble as a sum-up, is to perform random walk, and half of the starting vertices in $S$ (defined as $S^g$) are good. Among them, for each vertex there exists an epoch in which the speed that the probability mass moves forward is slowed down, and at this epoch, (C.4) is satisfied. Now one needs to use the fact that this epoch is large enough, so if conductance is large all the time, the probability mass should not have moved that slowly, resulting in a contradiction.

The last remark here is that, we do not know $S$ or even $S_b^g$ in advance. Therefore, by randomly selecting a vertex $v\in V$ from the whole graph, and randomly selecting a radius $b$ (with probability proportional to $2^{-b}$), and then run Nibble, one can thus construct a new procedure RandomNibble. RandomNibble that guarantees to cover each set $S$ satisfying (N.2) by constant expected number of vertices, and the expected running time for RandomNibble is only $O(\log^7 m / \phi^4$. Next, by running RandomNibble again and again, shaving off sparse cuts and then continue with the remaining graph, this gives a good balance cut pseudo-approximation, known as procedure Partition. For detailed definition about Partition and its application on sparsifiers, see my other post for [ST08b]: part I, part II.