When minimizing a convex function using first-order methods, if full gradients are too costly to compute at each iteration, there are two alternatives that can reduce this per-iteration cost. One is to use a (random) coordinate gradient , and the other is to use a stochastic gradient satisfying .
I focus on coordinate (gradient) descent (CD) in this tutorial, and leave it another post in the future to describe stochastic gradient descent (SGD). Part of this survey has also appeared in my conference talk at ICML.[[Remark: CD and SGD have fundamentally different analyses behind them, although sometimes I do see applied researchers confuse one from the other. For instance, one may ask if I define then don’t I automatically reduce CD to SGD? While this is a true statement, it only shows (1) one can use SGD to solve CD, but I promise you that analysis does not provide the tightest convergence rate, and (2) this reduction is not reversible.]]
Review of Gradient Descent (GD)
Gradient desent is usually analyzed when the function is smooth with respect to some parameter . One of the (almost-)equivalent definitions of smoothness is to say that , that is, the largest eigenvalue of the Hessian at any point is at most .
Smoothness implies if one moves from to , then it must satisfy . In other words, the objective not only has to decrease, but has to decrease by . Using this property plus a few lines of proof, one can turn this into a convergence statement: after applying this update times, it satisfies .
One-Lined Proof of Coordinate Descent (CD)
We say is coordinate-smooth with respect to parameter if for each coordinate . Suppose has coordinates, then it is easy to verify that . [[Recall, an all-one matrix has maximum eigenvalue but all diagonal entries are 1.]]
Coordinate-smoothness implies for every , if I move from to , then it satisfies . In particular, if I choose uniformly at random in , then I have the guarantee
Plugging this inequality into the same “a few lines of proof” in GD, we have .
To sum up, coordinate descent (CD) has a number of iterations that is greater than GD. However, since each iteration of CD is usually times faster to compute than GD, in the end one obtains a speed up factor .
Review of Accelerated Gradient Descent (AGD)
To turn the convergence of GD into the so-called accelerated rate , one has to perform Nesterov’s acceleration scheme which, in my personal view, is a simple linear coupling of gradient and mirror descent. In this linear-coupling framework, let me quickly point out the key lemmas used in AGD:
- gradient descent: ;
- mirror descent:
It is not even necessary in this tutorial to define precisely what are — interested readers can find them here. The key point is that the AGD convergence proof combines (known as linear coupling) the above two inequalities, resulting in
- coupling: .
Now using a few lines of proof, one can turn this inequality into a convergence statement saying .
Three-Lined Proof of Accelerated Coordinate Descent (ACD)
The aforementioned gradient and mirror descent lemmas have their trivial generalizations in the coordinate-descent setting, namely, ignoring the expectation sign,
- coordinate gradient descent: ;
- coordinate mirror descent:
[[We have in fact already proved the first item in the one-lined proof section; the second item is as simple as the first line to prove.]]
Next, if we do the same linear coupling as in the AGD case, we get
- coupling: .
We emphasize that the only difference between the above coupling inequality and that in AGD is that is replaced with . Therefore, I claim we are done because the same “few lines of proof” in the AGD case implies a convergence statement .
Again, to sum up, accelerated coordinate descent (ACD) has a number of iterations that is greater than AGD. However, since each iteration of ACD is usually times faster than GD, in the end one obtains a speed up factor .
If one extends first-order methods to slightly more involved settings, a few extra lines of analyses may be necessary for coordinate descent. Let me point out a few:
- Suppose each for a different smoothness parameter , then what can we do?
- This is trivial in CD: pick each coordinate with prob. proportional to , and the final convergence depends on the average of rather than which is the maximum.
- This is not trivial in ACD (see this paper): the optimal choice is to pick coordinate with prob. proportional to . Then, replace with (which is strictly smaller) in the final convergence statement.
[[This can be seen easily from the linear coupling argument above, and very surprisingly, all previous result before my work uses either uniform distribution, or probabilities proportional to . They are not as good as the square-root distribution. ]]
- If is also known to be strongly convex, then one can turn all the convergence rates mentioned in this post (resp. for GD, CD, AGD, ACD) into their linear-convergence formats. I refrain from doing so because it is trivial to do so in a black-box manner, see the strong convexity section of this paper. Even if one wants to do it in a white-box manner, it is not hard and the ACD result is included here.
- One may require the function to be in a composite form , where is some possibly non-smooth but well-structured function, such as the norm. It is now essentially a folklore in optimization that one can easily turn a first-order that works for , into a so-called proximal first-order method for . This is not trivial, but not hard. For instance, FISTA is such a proximal generalization of AGD.