## Generalization for Stochastic / Online Convex Learning

I haven’t touched learning theory for a couple of months, but I always keep my eyes on. Partially due to the reason that I am seeking for a learning theory course project, I decided to take a brief look at what Prof Shai Shalev-Shwartz is doing recently.

Shai is famous for his innovations in online learning, including the algorithm PEGASOS, the concept of inverse time dependency, and now the newest result on zero-one loss (COLT 2010 best paper). His team is also good at summarizing important contemporary results, adding their insightful discussions and comparisons. This time, I’m going to briefly introduce the following paper.

## “Stochastic Convex Optimization” in COLT 2009

It is well-known that there is a huge difference between the empirical objective $\hat{F}(w)=\hat{\mathbb{E}}[f(w;z)]=\frac{1}{n}\sum_{i=1}^n f(w;z_i)$, and the expected objective $F(w)=\mathbb{E}_Z[f(w;Z)]$. We normally use $\hat{w}$ to denote the minimizer for the former, and $w^*$ the minimizer for the latter.

This paper basically analyzed three classes of problems (see the above figure). The first and the smallest class is the one that guarantees uniform convergence: $\sup_{w\in\mathcal{W}}\left| F(w)-\hat{F}(w)\right| \rightarrow 0$, which says that the difference between two kinds of objectives tends to zero as $n\rightarrow \infty$. A relatively larger class is the one that is learnable by empirical minimizer: $F(\hat{w})-F(w^*) \rightarrow 0$, which guarantees point-wise convergence at the empirical minimizer. At last, the largest class of learnable functions is defined as “there exists a rule for choosing $\tilde{w}$ based on samples, such that $F(\tilde{w})-F(w^*) \rightarrow 0$“.

All of the generalized linear problems (including SVM, logistic regression and all kinds of supervised learning) are included in the set of uniform convergence. However, even though most of the problems (satisfying convexity, Lipschitz, boundedness, etc) are learnable, some of them do not guarantee any uniform convergence. Even though some of them guarantee uniform convergence, there exist part of it only guarantee the point-wise convergence at the empirical minimizer.

• In the largest class of learnable functions, one can adopt stochastic algorithm to guarantee $\sqrt{\frac{1}{n}}$ rate for convex functions, $\frac{1}{n}$ rate for strongly-convex functions.
• In the smallest class that guarantees uniform convergence, when it is a generalized linear function, we also have the similar $\sqrt{\frac{1}{n}}$ vs $\frac{1}{n}$ rate for convex / strongly-convex loss.
• Though none of the above two are new, this paper proves that all of the strongly-convex functions are learnable by empirical minimizer.

As a whole, this paper has summarized all kinds of generalization guarantees used in the past a few years, which must be a good reference for learning theorists. As a supplement, one might like to refer to the following two papers: