## SVD and Eigenvectors for Big Data

People generally believe that a PC can handle SVD for matrices only up to thousands by thousands. Textbooks also suggest that it is not wise to compute singular vectors one by one. In this post, I’ll refute both statements. In particular, on a PC within 10 seconds without even parallelism, we can find the top 20 eigenvectors of a 260k x 260k matrix with 1.2 million entries. Moreover, the algorithm is simply to invoke rank-1 SVD 20 times.

# 1. A Toy Problem and An Obvious Solution

Since SVD reduces to the eigenvector problem, I’ll only describe the latter for simplicity. Given matrix $A\in\mathbb{R}^{d\times d}$ that is symmetric, positive semidefinite (PSD), we wish to find the leading eigenvector of $A$. Call this toy problem 1-PCA.

From a random vector $v$, one can apply power method to compute $A^T v$ and normalize it. The resulting vector $v_T$ is a good approximation to 1-PCA. More specifically, denoting by $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$ the eigenvalues of $A$ and $\nu_1,\dots,\nu_d\in\mathbb{R}^d$ the corresponding eigenvectors, then we have:

• If $T\geq \Omega\big( \frac{1}{gap} \log (1/\varepsilon) \big)$ where $gap := \frac{\lambda_1 - \lambda_2}{\lambda_1}$ then it satisfies $\| v_T - \nu_1 \|_2 \leq \varepsilon$.
• If $T\geq \Omega \big( \frac{1}{\delta} \log(1/\delta) \big)$ then it satisfies $v_T^\top A v_T \geq (1-\delta) \lambda_1$

The first statement is stronger since $v_T$ converges to the eigenvector $\nu_1$, but it also relies on the (relative) eigengap $gap=\frac{\lambda_1-\lambda_2}{\lambda_1} \leq 1$ being large. In particular, if $\lambda_1=\lambda_2$ the first statement is useless. This is why we often hear “power method relies on eigengap and therefore is not always useful”. I disagree with this, because if the eigengap is small or even zero, then the problem of “finding $\nu_1$” itself is a non-interesting question. For instance, if there are two eigenvectors that give relatively the same eigenvalue, then there is no need what-so-ever in real life to distinguish the two. In other words, only problems that have stable solutions are meaningful.

To this extend, the second statement is a stable one: regardless of the size of the eigengap, we can always converge to a unit vector $v_T$ whose Rayleigh quotient $v_T^\top A v_T$ converges to optimal. This is a meaningful solution to a meaningful question.

We call the first statement the gap-dependent and the second statement the gap-free.

# 2. Two Less Obvious Solutions to 1-PCA

Power method is very fast because it reduces the problem to matrix-vector multiplications, which can be highly efficient if the original matrix $A$ has a sparse representation. There are two more improvements one can do beyond power method.

• Acceleration. One can use Lanczos method to reduce the number of iterations quadratically from $T=\Omega\big( \frac{1}{\delta} \log (1/\delta) \big)$ to $T=\Omega\big( \frac{1}{\sqrt{\delta}} \log (1/\delta) \big)$ for the gap-free case.
• Stochasticity. If $A=\frac{1}{n}\sum_{i=1}^n A_i$ is decomposed into pieces, one can compute matrix-vector multiplications stochastically (i.e., to use $A_i w$ to approximate $A w$ for any vector $w$). For instance, if each $A_i$ is of sparsity $O(d)$ then the total running time can be improved from $T\times nd = \tilde{O}(nd / \sqrt{\delta})$ (of Lanczos) to roughly $\tilde{O}(nd + n^{3/4}d / \sqrt{\delta})$. This is first obtained by this paper and this paper.
People formed diverse opinions about whether the result is tight. At the same time, unless the matrix is very huge (more than billions of entries), the current stochastic solver remains slower than the Lanczos method in practice (my own experience).

The above improvements also hold for the gap-dependent case, but ignored for simplicity.

# 3. Back to the General k-PCA Problem

Next we study k-PCA, the problem of finding the top $k$ eigenvectors of matrix $A$. Think about k being a small quantity as compared to the dimension d. What can we do now?

• How about invoking 1-PCA $k$ times? After each eigenvector $v_i$ is obtained, try to project $A$ to $v_i^\bot$ and perform 1-PCA again? We call this the LazyPCA approach.

If one looks into textbooks or research papers, people often say that LazyPCA “does not work well”. Their argument generally looks like “we have to compute $v_i$ close to perfect, but that will cost the 1-PCA algorithm to run very slowly and depend on all the first $k$ eigengaps“. Because of this general belief, many direct algorithms such as the block power method or block Lanczos method have been developed for k-PCA and are considered the “right” methods in textbooks.

# 4. Surprisingly, General Belief is Wrong

In a recent result, we showed that LazyPCA (the approach of calling 1-PCA $k$ times) surprisingly runs faster than all direct algorithms.

Before presenting why this is true, let us recall how the (gap-dependent) performance looks like for a direct algorithm. If there is an eigengap $gap=\frac{\lambda_k-\lambda_{k+1}}{\lambda_k}$ between the k-th and (k+1)-th eigenvalues, then block Lanczos finds a subspace $V \in \mathbb{R}^{d\times k}$ that is $\varepsilon$ close to the span of $\nu_1,\dots,\nu_k$ using $O(\frac{k}{\sqrt{gap}} \log(1/\varepsilon))$ matrix-vector multiplications plus some overhead.

Now consider LazyPCA. We claim that most 1-PCA algorithms produce a vector $v_1$ that almost certainly lies in the span of $\nu_1,\dots,\nu_k$. If Lanczos is used as this 1-PCA algorithm, the complexity is $\tilde{O}(1/gap)$ matrix-vector multiplications where $gap = \frac{\lambda_k-\lambda_{k+1}}{\lambda_k}$. This complexity is independent from the gaps between the first k eigenvalues themselves.

Next, by the Cauchy interlacing lemma, the new matrix $A'$ ——which is defined as $A$ but projecting out the dimension of $v_1$—— has its first $k-1$ eigenvalues “interlacing” between the first $k$ eigenvalues of $A$, and has its last $d-k$ eigenvalues almost the same as those of $A$. See the picture below:

In other words, the (k-1)-th eigengap of this next matrix $A'$ is no smaller than the old value $gap$, so we can continue to call 1-PCA on $A'$ and repeat this procedure for $k$ times. The total number of matrix-vector multiplications is $\tilde{O}(k/gap)$, the same as block Lancsoz.

# 5. Therefore…

Performing 1-Lanczos $k$ times thus gives a LazyPCA algorithm that requires the same number of matrix-vector multiplications as the block k-Lanczos. If one includes the overhead besides matrix-vector multiplications, then LazyPCA runs even faster than block Lanczos. For the 260k x 260k matrix I mentioned at the beginning, LazyPCA finds the first 20 eigenvectors in 10sec but block Lanczos requires more than 40sec.

## Several more surprising consequences

• The same LazyPCA approach also applies to gap-free settings but with more involved proofs. That is, even if there is no eigengap between $\lambda_k$ and $\lambda_{k+1}$, the approach still finds k unit vectors $v_1,\dots,v_k$ satisfying $v_i^\top A v_i \geq (1-\delta)\lambda_i$ in time even faster than block Lanczos. This requires a version of Wedin’s theorem that we proved and replaces the use of the interlacing lemma above. See our paper for the statement and proof.
• If the 1-PCA algorithm is stochastic (see Section 2 above), then the LazyPCA approach gives the first stochastic & accelerated performance for the k-PCA problem. In contrast, there is no known stochastic version of the block Lanczos method.
• This LazyPCA framework not only applies to PCA/SVD, but also other eigen-flavored problems, including canonical correlation analysis (CCA) and generalized eigenvectors. It outperforms all known iterative methods. See this paper.

## The punchline again

If one ever wants to compute the top k eigenvectors / singularvectors / correlation vectors / generalized eigenvectors, just compute the top 1 repeatedly for k times! Isn’t this simple?