## Learning Zero-One Loss Objective – The Theory

(This thread comes from the preliminary version of a subsection of my on-going survey, which will be published soon)

Not satisfied by the result in the previous section, [SSSS10] made an interesting attempt towards learning 0-1 objective functions. In classification problems with half-plane classifier, the following objective function is more desirable than any other (regularized or not) convex objective (e.g. hinge loss, logistic loss):

$f(\mathbf{w};\mathbf{\theta}=(\mathbf{x},y))=\left| \frac{1}{2}\textrm{sgn}(\langle \mathbf{w}, \phi(\mathbf{x})\rangle + \frac{1}{2} - y \right|$

Notice that the label $y\in \{0,1\}$.

## The Theory

If we define $\varphi_{0-1}(a)=\frac{1}{2}(\textrm{sgn}(a)+1)$, the above objective function is characterized by the following concept class

$H_{\varphi_{0-1}} = \{ \mathbf{x} \rightarrow \varphi_{0-1}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle) \} \enspace,$

and we are interested in optimizing the following stochastic objective:

$F(h) = \mathbb{E}_{(x,y)\sim D} [| h(x)-y |], \quad h \in H_{\varphi_{0-1}} \enspace.$ (1)

The first step of this attempt requires to approximate the concept class $H_{\varphi_{0-1}}$ using Lipschitz continuous functions. Define

$\varphi_{\mathrm{sig}}(a) = \frac{1}{1+\exp(-4La)}\enspace,$

which is L-Lipschitz continuous, and approximates $\varphi_{0-1}$ well. (In [SSSS10] they also analyzed other two approximated functions, but for the lack of space they are ignored here.) We consider the following concept class for the stochastic objective (Eq. (1))

$H_{\varphi_{\textrm{sig}}} = \{ \mathbf{x} \rightarrow \varphi_{\textrm{sig}}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle) \} \enspace.$

One advantage of such approximation is to allow theorems like Rademacher generalization bound [BM03] to hold. Indeed, the empirical minimizer,

$\hat{\mathbf{w}} = \mathop{\textrm{argmin}}_{\mathbf{w}} \hat{F}(\mathbf{w}) = \mathop{\textrm{argmin}}_{\mathbf{w}} \frac{1}{m} \sum_{i=1}^m | \varphi_{\textrm{sig}}(\langle \mathbf{w}, \mathbf{x}_i \rangle) - y_i | \enspace,$

gives a generalization error bound $F(\hat{\mathbf{w}}) - F(\mathbf{w}) < \epsilon$ when $m=\tilde{\Omega}(L^2/\epsilon^2)$. However, it has been pointed out that the empirical minimization is hard since the objective is not convex.

To conquer such difficulty, a new concept class is introduced: $H_B = \{ \mathbf{x} \rightarrow \langle \mathbf{w}, \psi(\mathbf{x})\rangle : \| \mathbf{w} \|^2 \leq B \} \enspace,$ and its difference from $H_{\varphi_{0-1}}$ or $H_{\varphi_{\textrm{sig}}}$ is two-fold. First, it no longer uses a 0-1 function in the prediction; the traditional half-plane classification using inner-product is adopted. Second, it enables a new kernel $\psi$, which is defined as ($\nu$ can be chosen to be $1/2$ for the ease of presentation):

$\langle \psi(\mathbf{x}), \psi(\mathbf{x}') \rangle = \frac{1}{1-\nu \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle} \enspace.$

Choosing $B=\Omega(\exp(L\log(\frac{L}{\epsilon})))$ large enough, [SSSS10] proved that $H_B$ approximately includes $H_{\varphi_{\textrm{sig}}}$, and therefore we can directly study the learning problem in $H_B$. This only requires the Lipschitz continuity of $\varphi_{\textrm{sig}}$ and Chebyshev approximation technique, and is a very general proof.

One big benefit of such conversion is that the new problem is convex and can be empirically optimized via for instance stochastic gradient descent. Pay attention that due to the boundedness of $H_B$, using Rademacher complexity bound againÂ [BM03,KST08], a sample complexity of $\tilde{\Omega}(B/\epsilon^2)$ can be deduced.

Notice that the procedure above is improper learning: to learn $H_{\varphi_{0-1}}$ we actually incorporate a larger concept class $H_B$, and a classifier $h\in H_B$ will be returned which is close to the optimal classifier in $H_{\varphi_{0-1}}$. Furthermore, the overall time and sample complexity is $\mathrm{poly}(\exp(L \log(\frac{L}{\epsilon})))$. This bound is exponential w.r.t. $L$, but [SSSS10] also showed that a polynomial dependency on $L$ is impossible, unless some NP-hard problem is in P.

## The Experiment

See the follow-up post.