(This thread comes from the preliminary version of a subsection of my on-going survey, which will be published soon)

Not satisfied by the result in the previous section, [SSSS10] made an interesting attempt towards learning 0-1 objective functions. In classification problems with half-plane classifier, the following objective function is more desirable than any other (regularized or not) convex objective (e.g. hinge loss, logistic loss):

Notice that the label .

## The Theory

If we define , the above objective function is characterized by the following concept class

and we are interested in optimizing the following stochastic objective:

(1)

The first step of this attempt requires to approximate the concept class using Lipschitz continuous functions. Define

which is L-Lipschitz continuous, and approximates well. (In [SSSS10] they also analyzed other two approximated functions, but for the lack of space they are ignored here.) We consider the following concept class for the stochastic objective (Eq. (1))

One advantage of such approximation is to allow theorems like Rademacher generalization bound [BM03] to hold. Indeed, the empirical minimizer,

gives a generalization error bound when . However, it has been pointed out that the empirical minimization is hard since the objective is not convex.

To conquer such difficulty, a new concept class is introduced: and its difference from or is two-fold. First, it no longer uses a 0-1 function in the prediction; the traditional half-plane classification using inner-product is adopted. Second, it enables a new kernel , which is defined as ( can be chosen to be $1/2$ for the ease of presentation):

Choosing large enough, [SSSS10] proved that approximately includes , and therefore we can directly study the learning problem in . This only requires the Lipschitz continuity of and Chebyshev approximation technique, and is a very general proof.

One big benefit of such conversion is that the new problem is *convex* and can be empirically optimized via for instance stochastic gradient descent. Pay attention that due to the boundedness of , using Rademacher complexity bound againÂ [BM03,KST08], a sample complexity of can be deduced.

Notice that the procedure above is *improper* learning: to learn we actually incorporate a larger concept class , and a classifier will be returned which is close to the optimal classifier in . Furthermore, the overall time and sample complexity is . This bound is exponential w.r.t. , but [SSSS10] also showed that a polynomial dependency on is impossible, unless some NP-hard problem is in P.

## The Experiment

See the follow-up post.