## Learning Zero-One Loss Objective – The Experiment

(Continued from the previous post.)

Though the complexity depends crucially on $L$, however, when $L$ is a constant (e.g. $1$) there is still reason to believe that the approximated 0-1 loss function using $\varphi_{\mathrm{sig}}$ is more accurate than the hinge loss. However, even if $L$ is constant, the sample complexity $m=\Omega(B/\epsilon^2)=\Omega(1/\epsilon^3)$ has cubic dependency on $\epsilon$. At the same time, the kernel stochastic gradient descent algorithm (in the distribution manner) requires a time complexity of $O(m^2)$. So there comes a dilemma: neither can we set $m$ to be too large since otherwise the empirical optimization cannot finish in endurable time, nor can we set $m$ to be too small since otherwise the generalization guarantee is not good enough.

Consider some $m$ of medium size, just enough to be trained in several minutes for example. Though zero-one loss does not have a good generalization guarantee, however, it is a more desirable function than the hinge loss. So given training set of medium size, and we run SVM against a zero-one loss minimizer, who is to win this tug-of-war?

### Configuration

In the interest of fairness, the same stochastic gradient descent routine called P-packSVM [ZCW+09] has been adopted for the experiment, on an Intel Quad CPU machine with four cores (4 times speed-up). Regarding the empirical optimization for the zero-one loss, two different approaches are implemented: ZeroOne refers to the classical mirror descent, and ZeroOne-reg refers to the regularized zero one loss objective with strongly-convex mirror descent. At last, PEGASOS refers to the empirical optimizer for regularized SVM algorithm.

The three datasets Splice, Web and Adult presented by the libSVM project team [Fan] are used.

 # Training / Testing Samples # Features Splice 1000 / 2175 60 Web 2477 / 47272 300 Adult 1605 / 30956 123

### Results

From Table 1, one can see that when $m$ is on the magnitude of $1000$, it is still unclear that which method outperforms which. For the dataset of Adult, the traditional SVM trainer PEGASOS significantly beats the zero one loss optimizers; while back to Splice and Web, zero-one loss does slightly better.

Another experiment worth conducting is the convergence rate of the three methods. From Figure 2 one can see that because ZeroOne uses a learning rate of $1/\sqrt{T}$ where $T$ is the total number of iterations, the accuracy curve is smoother than that of PEGASOS and ZeroOne-reg, who uses learning rate $1/T$ for strongly-convexity. However, because the generalization guarantee for 0-1 loss is weaker than SVM, ZeroOne-reg converges much slower than PEGASOS.

Advertisements
This entry was posted in Learning Theory, Readings. Bookmark the permalink.