Coordinate Descent in One Line, or Three if Accelerated

When minimizing a convex function f(x) using first-order methods, if full gradients \nabla f(x) are too costly to compute at each iteration, there are two alternatives that can reduce this per-iteration cost. One is to use a (random) coordinate gradient \nabla_i f(x), and the other is to use a stochastic gradient \tilde{\nabla} f(x) satisfying \mathbb{E}[\tilde{\nabla} f(x)] = \nabla f(x).

I focus on coordinate (gradient) descent (CD) in this tutorial, and leave it another post in the future to describe stochastic gradient descent (SGD). Part of this survey has also appeared in my conference talk at ICML.[[Remark: CD and SGD have fundamentally different analyses behind them, although sometimes I do see applied researchers confuse one from the other. For instance, one may ask if I define \tilde{\nabla} f(x) := n \nabla_i f(x) then don’t I automatically reduce CD to SGD? While this is a true statement, it only shows (1) one can use SGD to solve CD, but I promise you that analysis does not provide the tightest convergence rate, and (2) this reduction is not reversible.]]

Review of Gradient Descent (GD)

Gradient desent is usually analyzed when the function f(x) is smooth with respect to some parameter L. One of the (almost-)equivalent definitions of smoothness is to say that \|\nabla^2 f(x)\| \leq L, that is, the largest eigenvalue of the Hessian at any point is at most L.

Smoothness implies if one moves from x to x' := x - \frac{1}{L} \nabla f(x), then it must satisfy f(x) - f(x') \geq \frac{1}{2L} \|\nabla f(x)\|^2. In other words, the objective not only has to decrease, but has to decrease by \frac{1}{2L}  \|\nabla f(x) \|^2. Using this property plus a few lines of proof, one can turn this into a convergence statement: after applying this update T times, it satisfies f(x_T) - f(x^*) \leq O \big( \frac{L \|x_0-x^*\|^2}{T} \big) .

One-Lined Proof of Coordinate Descent (CD)

We say f(x) is coordinate-smooth with respect to parameter L_c if \nabla^2_{ii} f(x) \leq L_c for each coordinate i. Suppose f(x) has n coordinates, then it is easy to verify that L_c \leq L \leq n L_c. [[Recall, an all-one matrix has maximum eigenvalue n but all diagonal entries are 1.]]

Coordinate-smoothness implies for every i, if I move from x to x' := x - \frac{1}{L_c} \nabla_i f(x), then it satisfies f(x) - f(x') \geq \frac{1}{2L_c} (\nabla_i f(x))^2. In particular, if I choose i uniformly at random in [n], then I have the guarantee

  • f(x) - \mathbb{E}[f(x')] \geq \frac{1}{2L_c n} \| \nabla f(x) \|^2.

Plugging this inequality into the same “a few lines of proof” in GD, we have \mathbb{E}[f(x_T)] - f(x^*) \leq O \big( \frac{n L_c \|x_0-x^*\|^2}{T} \big) .

To sum up, coordinate descent (CD) has a number of iterations that is \frac{n L_c}{L} \in [1, n] greater than GD. However, since each iteration of CD is usually n times faster to compute than GD, in the end one obtains a speed up factor \frac{L_c}{L} \in \big[\frac{1}{n}, 1\big].

Review of Accelerated Gradient Descent (AGD)

To turn the 1/T convergence of GD into the so-called accelerated rate 1/T^2, one has to perform Nesterov’s acceleration scheme which, in my personal view, is a simple linear coupling of gradient and mirror descent. In this linear-coupling framework, let me quickly point out the key lemmas used in AGD:

  • gradient descent: f(x_{k+1}) - f(y_{k+1} ) \geq \frac{1}{2L} \|\nabla f(x_{k+1})\|^2;
  • mirror descent: \alpha \langle \nabla f(x_{k+1}), z_k - x^* \rangle \leq \frac{\alpha^2}{2} \|\nabla f(x_{k+1})\|^2 + \frac{1}{2}\|z_k - x^* \|^2 - \frac{1}{2}\|z_{k+1}-x^* \|^2

It is not even necessary in this tutorial to define precisely what x_k, y_k, z_k are — interested readers can find them here. The key point is that the AGD convergence proof combines (known as linear coupling) the above two inequalities, resulting in

  • coupling: \alpha \langle \nabla f(x_{k+1}), z_k - x^* \rangle \leq \frac{\alpha^2 L}{2} \big(f(x_{k+1}) - f(y_{k+1})\big) + \frac{1}{2}\|z_k - x^*\|^2 - \frac{1}{2}\|z_{k+1}-x^* \|^2 .

Now using a few lines of proof, one can turn this inequality into a convergence statement saying f(y_T) - f(x^*) \leq O \big( \frac{L \|x_0-x^*\|^2}{T^2} \big) .

Three-Lined Proof of Accelerated Coordinate Descent (ACD)

The aforementioned gradient and mirror descent lemmas have their trivial generalizations in the coordinate-descent setting, namely, ignoring the expectation sign,

  • coordinate gradient descent: f(x_{k+1}) - f(y_{k+1} ) \geq \frac{1}{2 n L_c} \|\nabla f(x_{k+1})\|^2;
  • coordinate mirror descent: \alpha \langle \nabla f(x_{k+1}), z_k - x^* \rangle \leq \frac{\alpha^2 n}{2} \|\nabla f(x_{k+1})\|^2 + \frac{1}{2}\|z_k - x^*\|^2 - \frac{1}{2}\|z_{k+1}-x^*\|^2

[[We have in fact already proved the first item in the one-lined proof section; the second item is as simple as the first line to prove.]]
Next, if we do the same linear coupling as in the AGD case, we get

  • coupling: \alpha \langle \nabla f(x_{k+1}), z_k - x^* \rangle \leq \frac{\alpha^2 n^2 L_c}{2} \big(f(x_{k+1}) - f(y_{k+1})\big) + \frac{1}{2}\|z_k - x^*\|^2 - \frac{1}{2}\|z_{k+1}-x^*\|^2 .

We emphasize that the only difference between the above coupling inequality and that in AGD is that L is replaced with n^2 L_c. Therefore, I claim we are done because the same “few lines of proof” in the AGD case implies a convergence statement f(y_T) - f(x^*) \leq O \big( \frac{n^2 L_c \|x_0-x^*\|^2}{T^2} \big) .

Again, to sum up, accelerated coordinate descent (ACD) has a number of iterations that is \frac{n \sqrt{L_c}}{\sqrt{L}} \in [\sqrt{n}, n] greater than AGD. However, since each iteration of ACD is usually n times faster than GD, in the end one obtains a speed up factor \frac{\sqrt{L_c}}{\sqrt{L}} \in \big[\frac{1}{\sqrt{n}}, 1\big].

Other Extensions

If one extends first-order methods to slightly more involved settings, a few extra lines of analyses may be necessary for coordinate descent. Let me point out a few:

  • Suppose each \nabla^2_{ii} f(x) \leq L_i for a different smoothness parameter L_i \leq L_c, then what can we do?
    • This is trivial in CD: pick each coordinate i with prob. proportional to L_i, and the final convergence depends on the average of L_i rather than L_c which is the maximum.
    • This is not trivial in ACD (see this paper): the optimal choice is to pick coordinate i with prob. proportional to \sqrt{L_i}. Then, replace n^2 L_c with (\sum_i \sqrt{L_i})^2 (which is strictly smaller) in the final convergence statement.
      [[This can be seen easily from the linear coupling argument above, and very surprisingly, all previous result before my work uses either uniform distribution, or probabilities proportional to L_i. They are not as good as the square-root distribution. ]]
  • If f(x) is also known to be strongly convex, then one can turn all the convergence rates mentioned in this post (resp. for GD, CD, AGD, ACD) into their linear-convergence formats. I refrain from doing so because it is trivial to do so in a black-box manner, see the strong convexity section of this paper. Even if one wants to do it in a white-box manner, it is not hard and the ACD result is included here.
  • One may require the function f(x) to be in a composite form f(x) + \psi(x), where \psi(x) is some possibly non-smooth but well-structured function, such as the \ell_1 norm. It is now essentially a folklore in optimization that one can easily turn a first-order that works for f(x), into a so-called proximal first-order method for f(x) + \psi(x). This is not trivial, but not hard. For instance, FISTA is such a proximal generalization of AGD.
Advertisements
This entry was posted in Optimization. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s