Faster Than SGD 2: the Katyusha Acceleration

SGD is well-known for large-scale optimization. In my mind, there are so-far two fundamentally different improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. The following picture predicts —in theory— the performance difference between the naive SGD, the variance-reduced SGD, and the accelerated SGD:
wolaHere, T is the number of stochastic iterations and a full-pass of the dataset corresponds to n iterations. Some simple conclusions from the picture includes:

  1. As long as T is more than a few multiples of n, then the naive SGD should never be used. The red curve is always slower than variance reduction / acceleration.
  2. Accelerated SGD is no slower than variance-reduced SGD in all parameter regimes. The blue and purple curves touch exactly at the same point when T is a few copies of n. Ignoring the log factor, if T = 100 n, then accelerated SGD in theory is 10 times faster than variance-reduced SGD.

More Formally

Consider again the famous composite convex minimization problem

\min_{x\in \mathbb{R}^d} \Big\{ F(x) := f(x) + \psi(x) := \frac{1}{n}\sum_{i=1}^n f_i(x) + \psi(x) \Big\} \enspace.

Assume for simplicity that each f_i(x) is L-smooth, and the function F(x) is not strongly convex. (These assumptions can be safely removed.) Then, the convergence rates of the vanilla SGD, variance-reduced SGD (best result in this paper), and accelerated SGD (best result in this paper) are respectively:

T=O\left( \frac{V \|x_0-x^*\|^2}{\varepsilon^2} \right), \quad T=O\left( n \log \varepsilon^{-1} + \frac{L \|x_0-x^*\|^2}{\varepsilon} \right) \quad T=O\left( n \log \varepsilon^{-1} + \frac{\sqrt{n L } \|x_0-x^*\|}{\sqrt{\varepsilon}} \right)

Here, V is a variance upper bound on \mathbb{E}_i \big[ \|\nabla f_i(x) - \nabla f(x)\|^2 \big]. The relationship between V and L depends on the structure of f(x); in general, V can be up to L^2 \|x_0 - x^*\|^2.

The Main Algorithmic Idea

Recall that the theory of acceleration is first introduced by Nesterov and studied in full-gradient and coordinate-gradient settings. The main idea is to use momentum, sometimes referred to as Nesterov momentum.

Informally speaking, instead of moving in the negative-gradient direction x_{t+1} = x_t - \eta \nabla f(x), one can move to x_{t+1} = x_t - \eta \nabla f(x_t) + \alpha (x_t - x_{t-1}) for some momentum parameter \alpha \in [0,1]. This reduces the convergence rate from 1/\varepsilon to 1/\sqrt{\varepsilon}, which is quadratically faster. (The actual algorithm is a big more involved.)

Unfortunately, Nesterov momentum does not work when stochastic gradients are present. There exist datasets (see the end of the post) for which momentum fails to give the 1/\sqrt{\varepsilon} convergence. The underlying reason is because the error incurred by stochastic gradients somehow get blown up by Nesterov momentum.

In this recent paper, we found a fix to this issue at least for convex optimization. At a high-level, on top of Nesterov momentum, we introduced a new Katyusha momentum, serving as a negative momentum to cancel out the error accumulated by Nesterov momentum.

Informally, in each of the first n iterations, we compute x_{t+1}' = x_t - \eta \tilde{\nabla} f(x_t) + \alpha (x_t - x_{t-1}) where \tilde{\nabla}f(x_t) is the variance-reduced stochastic gradient (see this post). Then, define the true next point x_{t+1} := \frac{1}{2} (x_0 + x_{t+1}'). As illustrated by the picture below, we are essentially putting a magnet around the starting point x_0, and retract to it back to x_0 with a rate 1/2.katyusha1.pngThen, once every n iterations, we move the magnet to x_n, x_{2n}, \dots. See the picture belowkatyusha2The black vectors are negative stochastic-gradient directions, and the green vectors are Nesterov momentum directions. We called these orange vectors the Katyusha momentum, which —seemingly making negative progress— in fact stabilize the algorithm. This novel idea ensures that the accelerated convergence rate 1/\sqrt{\varepsilon} is obtained firmly. This is the first known 1/\sqrt{\varepsilon} convergence rate among stochastic methods.

PS: the actual algorithm is a bit more involved than the pictures. See my paper for details.

An Experimental Illustration

The following simple experiment (on dataset RCV1) illustrates the comparison between variance-reduced SGD (red), my Katyusha method (black), and the naive “accelerated” method that does not use Katyusha momentum (dashed, tuned with the best parameter). It is clear from this picture that, without Katyusha momentum, the accelerated 1/\sqrt{\varepsilon} convergence rate may not be achievable.


What is Next?

This Katyusha method seems to be the first method that converges in the 1/\sqrt{\varepsilon} rate among stochastic methods for convex composite minimization.

When I wrote the paper, I conjectured that the Katyusha momentum should work in a bigger agenda and perhaps everyone working in large-scale optimization should “give Katyusha a hug.” I’m glad to see this week that a group of researchers already applied it to ADMM, and making accelerated stochastic convergence possible there.

How about non-convex optimization? Well, the naive Nesterov-accelerated SGD has been applied to deep learning, and works reasonably well. I somehow believe that the Katyusha-accelerated SGD works (at least for certain settings) for deep learning as well. I haven’t performed any numerical experiment yet. In general, there is no theory of acceleration for non-convex optimization, at least to this date.

This entry was posted in Optimization. Bookmark the permalink.

1 Response to Faster Than SGD 2: the Katyusha Acceleration

  1. maxmpmp says:

    it’s my honor to read your paper–Katyusha,and the effect of the method impressed me. can you share your code in blog,plz?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s