SVD and Eigenvectors for Big Data

People generally believe that a PC can handle SVD for matrices only up to thousands by thousands. Textbooks also suggest that it is not wise to compute singular vectors one by one. In this post, I’ll refute both statements. In particular, on a PC within 10 seconds without even parallelism, we can find the top 20 eigenvectors of a 260k x 260k matrix with 1.2 million entries. Moreover, the algorithm is simply to invoke rank-1 SVD 20 times.

1. A Toy Problem and An Obvious Solution

Since SVD reduces to the eigenvector problem, I’ll only describe the latter for simplicity. Given matrix A\in\mathbb{R}^{d\times d} that is symmetric, positive semidefinite (PSD), we wish to find the leading eigenvector of A. Call this toy problem 1-PCA.

From a random vector v, one can apply power method to compute A^T v and normalize it. The resulting vector v_T is a good approximation to 1-PCA. More specifically, denoting by \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 the eigenvalues of A and \nu_1,\dots,\nu_d\in\mathbb{R}^d the corresponding eigenvectors, then we have:

  • If T\geq \Omega\big( \frac{1}{gap} \log (1/\varepsilon) \big) where gap := \frac{\lambda_1 - \lambda_2}{\lambda_1} then it satisfies \| v_T - \nu_1 \|_2 \leq \varepsilon.
  • If T\geq \Omega \big( \frac{1}{\delta} \log(1/\delta) \big) then it satisfies v_T^\top A v_T \geq (1-\delta) \lambda_1

The first statement is stronger since v_T converges to the eigenvector \nu_1, but it also relies on the (relative) eigengap gap=\frac{\lambda_1-\lambda_2}{\lambda_1} \leq 1 being large. In particular, if \lambda_1=\lambda_2 the first statement is useless. This is why we often hear “power method relies on eigengap and therefore is not always useful”. I disagree with this, because if the eigengap is small or even zero, then the problem of “finding \nu_1” itself is a non-interesting question. For instance, if there are two eigenvectors that give relatively the same eigenvalue, then there is no need what-so-ever in real life to distinguish the two. In other words, only problems that have stable solutions are meaningful.

To this extend, the second statement is a stable one: regardless of the size of the eigengap, we can always converge to a unit vector v_T whose Rayleigh quotient v_T^\top A v_T converges to optimal. This is a meaningful solution to a meaningful question.

We call the first statement the gap-dependent and the second statement the gap-free.

2. Two Less Obvious Solutions to 1-PCA

Power method is very fast because it reduces the problem to matrix-vector multiplications, which can be highly efficient if the original matrix A has a sparse representation. There are two more improvements one can do beyond power method.

  • Acceleration. One can use Lanczos method to reduce the number of iterations quadratically from T=\Omega\big( \frac{1}{\delta} \log (1/\delta) \big) to T=\Omega\big( \frac{1}{\sqrt{\delta}} \log (1/\delta) \big) for the gap-free case.
  • Stochasticity. If A=\frac{1}{n}\sum_{i=1}^n A_i is decomposed into pieces, one can compute matrix-vector multiplications stochastically (i.e., to use A_i w to approximate A w for any vector w). For instance, if each A_i is of sparsity O(d) then the total running time can be improved from T\times nd = \tilde{O}(nd / \sqrt{\delta}) (of Lanczos) to roughly \tilde{O}(nd + n^{3/4}d / \sqrt{\delta}). This is first obtained by this paper and this paper.
    People formed diverse opinions about whether the result is tight. At the same time, unless the matrix is very huge (more than billions of entries), the current stochastic solver remains slower than the Lanczos method in practice (my own experience).

The above improvements also hold for the gap-dependent case, but ignored for simplicity.

3. Back to the General k-PCA Problem

Next we study k-PCA, the problem of finding the top k eigenvectors of matrix A. Think about k being a small quantity as compared to the dimension d. What can we do now?

  • How about invoking 1-PCA k times? After each eigenvector v_i is obtained, try to project A to v_i^\bot and perform 1-PCA again? We call this the LazyPCA approach.

If one looks into textbooks or research papers, people often say that LazyPCA “does not work well”. Their argument generally looks like “we have to compute v_i close to perfect, but that will cost the 1-PCA algorithm to run very slowly and depend on all the first k eigengaps“. Because of this general belief, many direct algorithms such as the block power method or block Lanczos method have been developed for k-PCA and are considered the “right” methods in textbooks.

4. Surprisingly, General Belief is Wrong

In a recent result, we showed that LazyPCA (the approach of calling 1-PCA k times) surprisingly runs faster than all direct algorithms.

Before presenting why this is true, let us recall how the (gap-dependent) performance looks like for a direct algorithm. If there is an eigengap gap=\frac{\lambda_k-\lambda_{k+1}}{\lambda_k} between the k-th and (k+1)-th eigenvalues, then block Lanczos finds a subspace V \in \mathbb{R}^{d\times k} that is \varepsilon close to the span of \nu_1,\dots,\nu_k using O(\frac{k}{\sqrt{gap}} \log(1/\varepsilon)) matrix-vector multiplications plus some overhead.

Now consider LazyPCA. We claim that most 1-PCA algorithms produce a vector v_1 that almost certainly lies in the span of \nu_1,\dots,\nu_k. If Lanczos is used as this 1-PCA algorithm, the complexity is \tilde{O}(1/gap) matrix-vector multiplications where gap = \frac{\lambda_k-\lambda_{k+1}}{\lambda_k}. This complexity is independent from the gaps between the first k eigenvalues themselves.

Next, by the Cauchy interlacing lemma, the new matrix A' ——which is defined as A but projecting out the dimension of v_1—— has its first k-1 eigenvalues “interlacing” between the first k eigenvalues of A, and has its last d-k eigenvalues almost the same as those of A. See the picture below:

k-svd-interlacing

In other words, the (k-1)-th eigengap of this next matrix A' is no smaller than the old value gap, so we can continue to call 1-PCA on A' and repeat this procedure for k times. The total number of matrix-vector multiplications is \tilde{O}(k/gap), the same as block Lancsoz.

5. Therefore…

Performing 1-Lanczos k times thus gives a LazyPCA algorithm that requires the same number of matrix-vector multiplications as the block k-Lanczos. If one includes the overhead besides matrix-vector multiplications, then LazyPCA runs even faster than block Lanczos. For the 260k x 260k matrix I mentioned at the beginning, LazyPCA finds the first 20 eigenvectors in 10sec but block Lanczos requires more than 40sec.

Several more surprising consequences

  • The same LazyPCA approach also applies to gap-free settings but with more involved proofs. That is, even if there is no eigengap between \lambda_k and \lambda_{k+1}, the approach still finds k unit vectors v_1,\dots,v_k satisfying v_i^\top A v_i \geq (1-\delta)\lambda_i in time even faster than block Lanczos. This requires a version of Wedin’s theorem that we proved and replaces the use of the interlacing lemma above. See our paper for the statement and proof.
  • If the 1-PCA algorithm is stochastic (see Section 2 above), then the LazyPCA approach gives the first stochastic & accelerated performance for the k-PCA problem. In contrast, there is no known stochastic version of the block Lanczos method.
  • This LazyPCA framework not only applies to PCA/SVD, but also other eigen-flavored problems, including canonical correlation analysis (CCA) and generalized eigenvectors. It outperforms all known iterative methods. See this paper.

The punchline again

If one ever wants to compute the top k eigenvectors / singularvectors / correlation vectors / generalized eigenvectors, just compute the top 1 repeatedly for k times! Isn’t this simple?

Advertisements
This entry was posted in Algorithm, Optimization. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s