Survey & Experiment: Towards the Learning Accuracy

This survey has been merged into a technical report, see here.

1. Introduction

A generic learning problem can be regarded as an optimization over parameter $\mathbf{w} \in \mathcal{W}$, and the loss function is given by $f(\mathbf{w};\theta)$ where $\theta$ is a given sample. An empirical objective $\hat{F}(\mathbf{w})$ can be written as

$\hat{F}(\mathbf{w})=\hat{\mathbb{E}}[f(\mathbf{w}; \theta)]=\frac{1}{n}\sum_{i=1}^n f(\mathbf{w};\theta_i)\enspace,$

here $\theta_1,\theta_2,\ldots,\theta_n$ are a sequence of observed samples.

It is normally assumed that these samples are i.i.d. drawn from some unknown distribution $D$, and therefore the stochastic objective $F(\mathbf{w})$ is often more desirable:

$F(\mathbf{w})=\mathbb{E}_{\theta \leftarrow D}[f(\mathbf{w};\theta)] \enspace.$

For example, if we take $f(\mathbf{w}; \theta) = \max\{0,1-y\langle \mathbf{w},\theta \rangle\} + \frac{\sigma}{2}\|\mathbf{w}\|_2^2$ we will arrive at the famous SVM, with a weighted L2 norm regularizer. The kernel trick can also be adopted which allows a non-linear prediction: $f(\mathbf{w}; \theta) = \max\{0,1-y\langle \mathbf{w},\phi(\theta) \rangle\} + \frac{\sigma}{2}\|\mathbf{w}\|_2^2$.

Trial 1. To obtain a good learning accuracy, the first trial is to optimize over the empirical objective $\hat{F}(\mathbf{w})$. We normally assume that $f(\cdot;\theta)$ is a convex function because we can then utilize optimization techniques like interior point method or gradient descent to conquer the minimization of the empirical objective. Recent results in for example [DSSST10, SSSS07] have shown that in many cases, using the stochastic gradient descent experiences the fastest time complexity in minimizing the empirical objective, and [ZCW+09] has generalized this in the extent of using kernels and (batched) parallel computation. I will briefly describe the main framework of [DSSST10] in Section 2 because it embraces all previously known first-order algorithms as special cases.

Trial 2. To better understand the learning accuracy to future samples, we need to establish the connection between the stochastic objective $F(\mathbf{w})$ and empirical objective $\hat{F}(\mathbf{w})$. This is called the generalization. A recent paper [SSSSS09] completed a thorough classification over the types of learning problems, and tells us how well each of them guarantees a good generalization error bound. The main results of this paper is provided in Section 3 as a good reference, but this is only the second trial.

As a partial summary, in the above two trials, when $f$ is convex both results have a bound proportional to $\frac{1}{\sqrt{T}}$. In Step 1, this $T$ is the number of iterations (which is also proportional to the runtime) and the bound is the difference between the optimal solution and the one generated by the SGD. In Step 2, this $T$ is the number of samples and the bound we are interested is (imprecisely) the difference over the stochastic and empirical objective. However, if we further require $f$ to be strongly-convex, by for instance adding a regularized term, both bounds immediately decrease to $\frac{1}{T}$ instead of $\frac{1}{\sqrt{T}}$, and this partially explains why we add a regularizer from a theoretical point of view.

Trial 3. Now comes to the third trial, an attempt to bound the stochastic loss instead of the stochastic objective. As explained above, we often add a regularizer (e.g. $r(\mathbf{w})=\frac{\sigma}{2}\|\mathbf{w}\|^2$) to the loss function $f(\mathbf{w};\theta)=l(\mathbf{w};\theta)+r(\mathbf{w})$. Therefore, even if we achieve a close-to-optimal solution for the stochastic objective $F(\mathbf{w})$, it is still far away from the stochastic loss

$L(\mathbf{w})=\mathbb{E}_{\theta \leftarrow D}[l(\mathbf{w};\theta)]\enspace.$

To further build a connection between these two quantities, [SSS08,ZCZ+09] used a so-called oracle inequality and deduced a final bound for this stochastic loss, with respect to the running time of a program, and the number of training samples $n$. This work was recognized by the best paper awards of ICML 2008 and ICDM 2009, and briefly described in Section 4.

Trial 4. However, how well such loss function characterizes the word “accuracy” remains a problem. One can feel free to use any convex loss functions (like hinge loss or logistic loss), but they do not reflect the accuracy at all. In the paper of [SSSS10] , they consider the following non-convex objective:

$f(\mathbf{w};\theta=(x,y))=\left| \frac{1}{2}\textrm{sgn}(\langle \mathbf{w}, \phi(x)\rangle + \frac{1}{2} - y \right|$

Which is exactly the definition of accuracy (for two-class classification), and $y \in \{0,1\}$.

Though incorporating the traditional Rademacher generalization bound one can still obtain good generalization guarantee, the empirical optimization becomes hard (Appendix A of [SSSS10]). Realizing such difficulty, [SSSS10] studied improper learning and constructed a larger class of classifier. Not only the empirical optimization in the new concept class is convex, the minimizer is also good enough so that original zero-one objective problem is learnable using such minimizer. This paper was recognized by the best paper award of COLT 2010, and this report is also going to analyze its practical value against public data sets in Section 5.

2. Trial 1: Empirical Optimization

Mainly being an intro to this paper [DSSST10].

3. Trial 2: Stochastic Optimization

See the ﻿previous post, mainly about the result in this paper [SSSSS09].

4. Trial 3: Stochastic Loss Optimization

About some introduction to [SSSS07] and [ZCZ+09].

5. Trial 4: Stochastic Zero-One Loss Optimization

Contains a thorough study of [SSSS10] with experimental results. Also mentioned here.

6. References

[DSSST10] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, 2010.
[SSS08] Shai Shalev-Shwartz and Nathan Srebro. Svm optimization: inverse dependence on training set size. In ICML, pages 928–935, 2008.
[SSSS07] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated subgradient solver for svm. In ICML, pages 807–814, 2007.
[SSSS10] Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces with the zero-one loss. In COLT, 2010.
[SSSSS09] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT, 2009.
[ZCW+09] Zeyuan Allen Zhu, Weizhu Chen, Gang Wang, Chenguang Zhu, and Zheng Chen. P-packsvm: Parallel primal gradient descent kernel svm. In ICDM, pages 677–686, 2009.
[ZCZ+09] Zeyuan Allen Zhu, Weizhu Chen, Chenguang Zhu, Gang Wang, Haixun Wang, and Zheng Chen. Inverse time dependency in convex regularized learning. In ICDM, pages 667–676, 2009.