People generally believe that a PC can handle SVD for matrices only up to thousands by thousands. Textbooks also suggest that it is not wise to compute singular vectors one by one. In this post, I’ll refute both statements. In particular, on a PC within 10 seconds without even parallelism, we can find the top 20 eigenvectors of a 260k x 260k matrix with 1.2 million entries. Moreover, the algorithm is simply to invoke rank-1 SVD 20 times.
1. A Toy Problem and An Obvious Solution
Since SVD reduces to the eigenvector problem, I’ll only describe the latter for simplicity. Given matrix that is symmetric, positive semidefinite (PSD), we wish to find the leading eigenvector of . Call this toy problem 1-PCA.
From a random vector , one can apply power method to compute and normalize it. The resulting vector is a good approximation to 1-PCA. More specifically, denoting by the eigenvalues of and the corresponding eigenvectors, then we have:
- If where then it satisfies .
- If then it satisfies
The first statement is stronger since converges to the eigenvector , but it also relies on the (relative) eigengap being large. In particular, if the first statement is useless. This is why we often hear “power method relies on eigengap and therefore is not always useful”. I disagree with this, because if the eigengap is small or even zero, then the problem of “finding ” itself is a non-interesting question. For instance, if there are two eigenvectors that give relatively the same eigenvalue, then there is no need what-so-ever in real life to distinguish the two. In other words, only problems that have stable solutions are meaningful.
To this extend, the second statement is a stable one: regardless of the size of the eigengap, we can always converge to a unit vector whose Rayleigh quotient converges to optimal. This is a meaningful solution to a meaningful question.
We call the first statement the gap-dependent and the second statement the gap-free.
2. Two Less Obvious Solutions to 1-PCA
Power method is very fast because it reduces the problem to matrix-vector multiplications, which can be highly efficient if the original matrix has a sparse representation. There are two more improvements one can do beyond power method.
- Acceleration. One can use Lanczos method to reduce the number of iterations quadratically from to for the gap-free case.
- Stochasticity. If is decomposed into pieces, one can compute matrix-vector multiplications stochastically (i.e., to use to approximate for any vector ). For instance, if each is of sparsity then the total running time can be improved from (of Lanczos) to roughly . This is first obtained by this paper and this paper.
People formed diverse opinions about whether the result is tight. At the same time, unless the matrix is very huge (more than billions of entries), the current stochastic solver remains slower than the Lanczos method in practice (my own experience).
The above improvements also hold for the gap-dependent case, but ignored for simplicity.
3. Back to the General k-PCA Problem
Next we study k-PCA, the problem of finding the top eigenvectors of matrix . Think about k being a small quantity as compared to the dimension d. What can we do now?
- How about invoking 1-PCA times? After each eigenvector is obtained, try to project to and perform 1-PCA again? We call this the LazyPCA approach.
If one looks into textbooks or research papers, people often say that LazyPCA “does not work well”. Their argument generally looks like “we have to compute close to perfect, but that will cost the 1-PCA algorithm to run very slowly and depend on all the first eigengaps“. Because of this general belief, many direct algorithms such as the block power method or block Lanczos method have been developed for k-PCA and are considered the “right” methods in textbooks.
4. Surprisingly, General Belief is Wrong
In a recent result, we showed that LazyPCA (the approach of calling 1-PCA times) surprisingly runs faster than all direct algorithms.
Before presenting why this is true, let us recall how the (gap-dependent) performance looks like for a direct algorithm. If there is an eigengap between the k-th and (k+1)-th eigenvalues, then block Lanczos finds a subspace that is close to the span of using matrix-vector multiplications plus some overhead.
Now consider LazyPCA. We claim that most 1-PCA algorithms produce a vector that almost certainly lies in the span of . If Lanczos is used as this 1-PCA algorithm, the complexity is matrix-vector multiplications where . This complexity is independent from the gaps between the first k eigenvalues themselves.
Next, by the Cauchy interlacing lemma, the new matrix ——which is defined as but projecting out the dimension of —— has its first eigenvalues “interlacing” between the first eigenvalues of , and has its last eigenvalues almost the same as those of . See the picture below:
In other words, the (k-1)-th eigengap of this next matrix is no smaller than the old value , so we can continue to call 1-PCA on and repeat this procedure for times. The total number of matrix-vector multiplications is , the same as block Lancsoz.
Performing 1-Lanczos times thus gives a LazyPCA algorithm that requires the same number of matrix-vector multiplications as the block k-Lanczos. If one includes the overhead besides matrix-vector multiplications, then LazyPCA runs even faster than block Lanczos. For the 260k x 260k matrix I mentioned at the beginning, LazyPCA finds the first 20 eigenvectors in 10sec but block Lanczos requires more than 40sec.
Several more surprising consequences
- The same LazyPCA approach also applies to gap-free settings but with more involved proofs. That is, even if there is no eigengap between and , the approach still finds k unit vectors satisfying in time even faster than block Lanczos. This requires a version of Wedin’s theorem that we proved and replaces the use of the interlacing lemma above. See our paper for the statement and proof.
- If the 1-PCA algorithm is stochastic (see Section 2 above), then the LazyPCA approach gives the first stochastic & accelerated performance for the k-PCA problem. In contrast, there is no known stochastic version of the block Lanczos method.
- This LazyPCA framework not only applies to PCA/SVD, but also other eigen-flavored problems, including canonical correlation analysis (CCA) and generalized eigenvectors. It outperforms all known iterative methods. See this paper.
The punchline again
If one ever wants to compute the top k eigenvectors / singularvectors / correlation vectors / generalized eigenvectors, just compute the top 1 repeatedly for k times! Isn’t this simple?