## Faster Than SGD 2: the Katyusha Acceleration

SGD is well-known for large-scale optimization. In my mind, there are so-far two fundamentally different improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. The following picture predicts —in theory— the performance difference between the naive SGD, the variance-reduced SGD, and the accelerated SGD: Here, $T$ is the number of stochastic iterations and a full-pass of the dataset corresponds to $n$ iterations. Some simple conclusions from the picture includes:

1. As long as $T$ is more than a few multiples of $n$, then the naive SGD should never be used. The red curve is always slower than variance reduction / acceleration.
2. Accelerated SGD is no slower than variance-reduced SGD in all parameter regimes. The blue and purple curves touch exactly at the same point when $T$ is a few copies of $n$. Ignoring the log factor, if $T = 100 n$, then accelerated SGD in theory is $10$ times faster than variance-reduced SGD.

# More Formally

Consider again the famous composite convex minimization problem $\min_{x\in \mathbb{R}^d} \Big\{ F(x) := f(x) + \psi(x) := \frac{1}{n}\sum_{i=1}^n f_i(x) + \psi(x) \Big\} \enspace.$

Assume for simplicity that each $f_i(x)$ is $L$-smooth, and the function $F(x)$ is not strongly convex. (These assumptions can be safely removed.) Then, the convergence rates of the vanilla SGD, variance-reduced SGD (best result in this paper), and accelerated SGD (best result in this paper) are respectively: $T=O\left( \frac{V \|x_0-x^*\|^2}{\varepsilon^2} \right), \quad T=O\left( n \log \varepsilon^{-1} + \frac{L \|x_0-x^*\|^2}{\varepsilon} \right) \quad T=O\left( n \log \varepsilon^{-1} + \frac{\sqrt{n L } \|x_0-x^*\|}{\sqrt{\varepsilon}} \right)$

Here, $V$ is a variance upper bound on $\mathbb{E}_i \big[ \|\nabla f_i(x) - \nabla f(x)\|^2 \big]$. The relationship between $V$ and $L$ depends on the structure of $f(x)$; in general, $V$ can be up to $L^2 \|x_0 - x^*\|^2$.

# The Main Algorithmic Idea

Recall that the theory of acceleration is first introduced by Nesterov and studied in full-gradient and coordinate-gradient settings. The main idea is to use momentum, sometimes referred to as Nesterov momentum.

Informally speaking, instead of moving in the negative-gradient direction $x_{t+1} = x_t - \eta \nabla f(x)$, one can move to $x_{t+1} = x_t - \eta \nabla f(x_t) + \alpha (x_t - x_{t-1})$ for some momentum parameter $\alpha \in [0,1]$. This reduces the convergence rate from $1/\varepsilon$ to $1/\sqrt{\varepsilon}$, which is quadratically faster. (The actual algorithm is a big more involved.)

Unfortunately, Nesterov momentum does not work when stochastic gradients are present. There exist datasets (see the end of the post) for which momentum fails to give the $1/\sqrt{\varepsilon}$ convergence. The underlying reason is because the error incurred by stochastic gradients somehow get blown up by Nesterov momentum.

In this recent paper, we found a fix to this issue at least for convex optimization. At a high-level, on top of Nesterov momentum, we introduced a new Katyusha momentum, serving as a negative momentum to cancel out the error accumulated by Nesterov momentum.

Informally, in each of the first $n$ iterations, we compute $x_{t+1}' = x_t - \eta \tilde{\nabla} f(x_t) + \alpha (x_t - x_{t-1})$ where $\tilde{\nabla}f(x_t)$ is the variance-reduced stochastic gradient (see this post). Then, define the true next point $x_{t+1} := \frac{1}{2} (x_0 + x_{t+1}')$. As illustrated by the picture below, we are essentially putting a magnet around the starting point $x_0$, and retract to it back to $x_0$ with a rate $1/2$. Then, once every $n$ iterations, we move the magnet to $x_n, x_{2n}, \dots$. See the picture below The black vectors are negative stochastic-gradient directions, and the green vectors are Nesterov momentum directions. We called these orange vectors the Katyusha momentum, which —seemingly making negative progress— in fact stabilize the algorithm. This novel idea ensures that the accelerated convergence rate $1/\sqrt{\varepsilon}$ is obtained firmly. This is the first known $1/\sqrt{\varepsilon}$ convergence rate among stochastic methods.

PS: the actual algorithm is a bit more involved than the pictures. See my paper for details.

# An Experimental Illustration

The following simple experiment (on dataset RCV1) illustrates the comparison between variance-reduced SGD (red), my Katyusha method (black), and the naive “accelerated” method that does not use Katyusha momentum (dashed, tuned with the best parameter). It is clear from this picture that, without Katyusha momentum, the accelerated $1/\sqrt{\varepsilon}$ convergence rate may not be achievable. # What is Next?

This Katyusha method seems to be the first method that converges in the $1/\sqrt{\varepsilon}$ rate among stochastic methods for convex composite minimization.

When I wrote the paper, I conjectured that the Katyusha momentum should work in a bigger agenda and perhaps everyone working in large-scale optimization should “give Katyusha a hug.” I’m glad to see this week that a group of researchers already applied it to ADMM, and making accelerated stochastic convergence possible there.

How about non-convex optimization? Well, the naive Nesterov-accelerated SGD has been applied to deep learning, and works reasonably well. I somehow believe that the Katyusha-accelerated SGD works (at least for certain settings) for deep learning as well. I haven’t performed any numerical experiment yet. In general, there is no theory of acceleration for non-convex optimization, at least to this date.