Learning Zero-One Loss Objective – The Theory

(This thread comes from the preliminary version of a subsection of my on-going survey, which will be published soon)

Not satisfied by the result in the previous section, [SSSS10] made an interesting attempt towards learning 0-1 objective functions. In classification problems with half-plane classifier, the following objective function is more desirable than any other (regularized or not) convex objective (e.g. hinge loss, logistic loss):

f(\mathbf{w};\mathbf{\theta}=(\mathbf{x},y))=\left| \frac{1}{2}\textrm{sgn}(\langle \mathbf{w}, \phi(\mathbf{x})\rangle + \frac{1}{2} - y \right|

Notice that the label y\in \{0,1\}.

The Theory

If we define \varphi_{0-1}(a)=\frac{1}{2}(\textrm{sgn}(a)+1), the above objective function is characterized by the following concept class

H_{\varphi_{0-1}} = \{ \mathbf{x} \rightarrow \varphi_{0-1}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle) \} \enspace,

and we are interested in optimizing the following stochastic objective:

F(h) = \mathbb{E}_{(x,y)\sim D} [| h(x)-y |], \quad h \in H_{\varphi_{0-1}} \enspace. (1)

The first step of this attempt requires to approximate the concept class H_{\varphi_{0-1}} using Lipschitz continuous functions. Define

\varphi_{\mathrm{sig}}(a) = \frac{1}{1+\exp(-4La)}\enspace,

which is L-Lipschitz continuous, and approximates \varphi_{0-1} well. (In [SSSS10] they also analyzed other two approximated functions, but for the lack of space they are ignored here.) We consider the following concept class for the stochastic objective (Eq. (1))

H_{\varphi_{\textrm{sig}}} = \{ \mathbf{x} \rightarrow \varphi_{\textrm{sig}}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle) \} \enspace.

One advantage of such approximation is to allow theorems like Rademacher generalization bound [BM03] to hold. Indeed, the empirical minimizer,

\hat{\mathbf{w}} = \mathop{\textrm{argmin}}_{\mathbf{w}} \hat{F}(\mathbf{w}) = \mathop{\textrm{argmin}}_{\mathbf{w}} \frac{1}{m} \sum_{i=1}^m | \varphi_{\textrm{sig}}(\langle \mathbf{w}, \mathbf{x}_i \rangle) - y_i | \enspace,

gives a generalization error bound F(\hat{\mathbf{w}}) - F(\mathbf{w}) < \epsilon when m=\tilde{\Omega}(L^2/\epsilon^2). However, it has been pointed out that the empirical minimization is hard since the objective is not convex.

To conquer such difficulty, a new concept class is introduced: H_B = \{ \mathbf{x} \rightarrow \langle \mathbf{w}, \psi(\mathbf{x})\rangle : \| \mathbf{w} \|^2 \leq B \} \enspace, and its difference from H_{\varphi_{0-1}} or H_{\varphi_{\textrm{sig}}} is two-fold. First, it no longer uses a 0-1 function in the prediction; the traditional half-plane classification using inner-product is adopted. Second, it enables a new kernel \psi, which is defined as (\nu can be chosen to be $1/2$ for the ease of presentation):

\langle \psi(\mathbf{x}), \psi(\mathbf{x}') \rangle = \frac{1}{1-\nu \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle} \enspace.

Choosing B=\Omega(\exp(L\log(\frac{L}{\epsilon}))) large enough, [SSSS10] proved that H_B approximately includes H_{\varphi_{\textrm{sig}}}, and therefore we can directly study the learning problem in H_B. This only requires the Lipschitz continuity of \varphi_{\textrm{sig}} and Chebyshev approximation technique, and is a very general proof.

One big benefit of such conversion is that the new problem is convex and can be empirically optimized via for instance stochastic gradient descent. Pay attention that due to the boundedness of H_B, using Rademacher complexity bound again [BM03,KST08], a sample complexity of \tilde{\Omega}(B/\epsilon^2) can be deduced.

Notice that the procedure above is improper learning: to learn H_{\varphi_{0-1}} we actually incorporate a larger concept class H_B, and a classifier h\in H_B will be returned which is close to the optimal classifier in H_{\varphi_{0-1}}. Furthermore, the overall time and sample complexity is \mathrm{poly}(\exp(L \log(\frac{L}{\epsilon}))). This bound is exponential w.r.t. L, but [SSSS10] also showed that a polynomial dependency on L is impossible, unless some NP-hard problem is in P.

The Experiment

See the follow-up post.

This entry was posted in Learning Theory, Readings. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s