Learning Algorithms for DT, DNF, AC0 and r-juntas

This is a section of my reading notes to Prof. Ryan O’Donnell’s <Analysis of Boolean Functions>, I’ve summarized the learning algorithms in this post, which also functions as a personal reference.

Class

Random Examples

Membership Queries

poly-size DTs

{n^{O(\log(n/\varepsilon))}}
Theorem 5

{\mathrm{poly}(n/\varepsilon)}
Theorem 6

poly-size DNFs

{n^{O(\log(n/\varepsilon))}}
Theorem 7

{\mathrm{poly}(n/\varepsilon)}
({n^{O(\log\log n)}} in Theorem 11)

poly-size
depth-d circuits

{n^{O(\log^{d-1}(n/\varepsilon))}}
Theorem 13

same

{\log(n)}-juntas

{n^{0.704\log(n)+O(1)}}
Theorem 14

{\mathrm{poly}(n)}
(implied by Theorem 6)

 1.1. Tools

To learn a function {f} that is {\varepsilon<\frac{1}{4}-c} close to some {\chi_S} requires {O(n\log \frac{n}{\delta})} quires. Proof: Use local decoding on each input {e_i}. \Box

Theorem 1 (Goldreich-Levin) Given query access to {f:\{-1,1\}^n \rightarrow [-1,1]}, there is a {\mathrm{poly}(n,\frac{1}{\gamma}\log\frac{1}{\delta})}-time algorithm that outputs a list {L=\{S_1,\dots,S_m\}} such that

  1. if {\hat{f}(S)\geq \gamma} then {S \in L};
  2. if {S\in L} then {\hat{f}(S)\geq\frac{\gamma}{2}} holds w.p. {1-\delta}.

Proof: The proof starts with an estimation method (by sampling) to {W(\mathcal{S}):=\sum_{U\in \mathcal{S}}\hat{f}(U)^2} for {\mathcal{S}=(a_1,\dots,a_k,\star,\dots,\star)}. Actually,

\displaystyle W(\mathcal{S}) = \mathbb{E}_x [F_{S\subset I}(x)^2] = \mathbb{E}_x[\hat{f}_{x\rightarrow \bar{I}}(S)^2] \enspace.

Next, do the split-into-half algorithm: starting with {(\star,\dots,\star)} and estimate its weight. If larger than {\gamma^2/2} then preserve (and further split); otherwise remove it. Eventually only sets {S_1,\dots,S_m} are left. \Box

Lemma 2 (Learning via spectral concentration) If knowing a collection {\mathcal{S}} that the function {f} is {\varepsilon}-concentrated on, then can output some {h} that is {\varepsilon}-close to {f} in time {\mathrm{poly}(|S|,1/\varepsilon,n)\log(1/\delta)} using random samples only.

If {\mathcal{S}} is not known explicitly, but {|\mathcal{S}|=M}, then can learn {f} in time {\mathrm{poly}(M,1/\varepsilon,n)} using membership queries.

Proof: The former is by estimating each {\hat{f}(S)} to within {\pm \sqrt{\varepsilon}/(4\sqrt{|S|})} in time {\mathrm{poly}(|S|,1/veps,n)}, and the latter is by using Goldreich-Levin Theorem 1 to get a good collection {\mathcal{S}'} that {f} is {2\varepsilon}-concentrated on. \Box

Lemma 3 (Learning via influence) The class {\mathcal{C}=\{f:\mathbb{I}(f)\leq d\}} is learnable in time {n^{O(d/\varepsilon)}} using random samples.

Proof: Because {\mathbb{I}(f)=\sum_S |S|\hat{f}(S)^2}, so if {\mathbb{I}(f)\leq d \Rightarrow \sum_{|S|>d/\varepsilon} \hat{f}(S)^2 \leq \varepsilon}. \Box

Lemma 4 (Learning via {L_1} norm) Any class of functions {\mathcal{C}=\{f : \|f\|_2^2\leq 1\text{ and }\|\hat{f}\|_1 \leq s\}} is learnable in time {\mathrm{poly}(s,1/\varepsilon)} using membership queries.

Proof: {f} is actually concentrated on {\mathcal{S}=\{S \subset [n] : |\hat{f}(S)|\geq \frac{\varepsilon}{\|\hat{f}\|_1}\}}. Notice that {|\mathcal{S}| \leq \Big(\frac{\|\hat{f}\|_1}{\varepsilon}\Big)^2}, and then one can use Lemma 2. \Box

1.2. Learning DT

Theorem 5 Decision trees of size {s} are learnable from random examples in time {n^{O(\log(s/\varepsilon)/\varepsilon)}=n^{O(\log n)}}.

Proof: Such DT tree is {\varepsilon}-close to some DT tree {g} with depth at most {\log(s/\varepsilon)}, and then the {\mathbb{I}(g)\leq \log(s/\varepsilon)} and use Lemma 3. \Box

Theorem 6 Decision trees of size {s} are learnable from membership queries in time {\mathrm{poly}(n)}. This also applies to decision trees on parities.

Proof: Let {P} be a path on the decision tree, and {\mathbf{1}_P(x)} be its indicator function. The Fourier expansion of {\mathbf{1}_P} looks like {\sum_{S\subset V} \pm 2^{-d} X_S} where {V} is the subset of variables on {P} and {d=|V|}.

This means that {\|\hat{f}\|_1 = \| \sum_P \hat{\mathbf{1}}_P \|_1 \leq \sum_P \| \hat{\mathbf{1}}_P \|_1 = s}, and then adopt Lemma 4. \Box

1.3. Learning DNF

Theorem 7 DNFs of size {s} are learnable from random examples in time {n^{O(\log(s/\varepsilon)/\varepsilon)}=n^{O(\log n)}}.

Proof: DNFs of size {s} can be truncated to width {w=\log(s/\varepsilon)}, while the latter has influence {\mathbb{I}(f)\leq 2w}, so one can directly adopt Lemma 3. \Box

We can actually do better via membership queries. We need two structural lemmas Lemma 9 and Lemma 10 first, and the overall idea is similar to the {L_1}-norm method because Lemma 10 is essentially an {L_1} bound.

Lemma 8 (Håstad’s Switching Lemma) Let {f:\{-1,1\}^n\rightarrow \{-1,1\}} be a width {w} computable DNF. When we apply a random restriction on the function {f} with {\star}-probability {\rho}, then

\displaystyle \Pr_{(I,X)}[ \mathrm{DT-depth}(f_{X\rightarrow \bar{I}})>d ] \leq (5\rho w)^d \enspace.

Lemma 9 DNFs of width {w} are {\varepsilon}-concentrated on degree up to {O(w\log(\frac{1}{\varepsilon}))}.

Proof: Using Håstad’s switching lemma with {\rho=\frac{1}{10w}}, we have {\sum_{S\subset I,|S|>d} \hat{f}_{X\rightarrow \bar{I}}(S)^2 \leq 2^{-d}}. After some further calculations, one can show that for all {d\geq 5}:

\displaystyle \sum_{|U|\geq 20dw}\hat{f}(U)^2 \leq 2^{-d+1} \enspace.

At last, let {d=\log(1/\varepsilon)} we get the answer. \Box

Lemma 10 If {f} is a width-{w} DNF, then

\displaystyle \sum_{|U|\leq O(w\log{\frac{1}{\varepsilon}})} |\hat{f}(U)| \leq w^{O(w\log \frac{1}{\varepsilon})} \enspace.

Proof: The proof is via a more general inequality {\sum_U \Big(\frac{1}{20w}\Big)^{|U|} |\hat{f}(U)| \leq 2} which can be proved also using Håstad’s switching lemma. \Box

Theorem 11 DNFs of size {\mathrm{poly}(n)} are learnable from membership queries in time {\Big(\frac{n}{\varepsilon}\Big)^{O(\log\log(n/\varepsilon)\cdot \log\frac{1}{\varepsilon})}=n^{O(\log\log n)}}.

Proof: First reduce DNFs to width {w=O(\log\frac{n}{\varepsilon})}, and then it is {\varepsilon}-concentrated on some collection of size {w^{O(w\log\frac{1}{\varepsilon})}}. This set can be defined as follows:

\displaystyle \mathcal{S} = \big\{ U : |U|\leq O(w\log \frac{1}{\varepsilon}), |\hat{f}(U)| \geq \frac{\varepsilon}{w^{O(w\log\frac{1}{\varepsilon})}} \big\} \enspace.

The proof is essentially a copy of Lemma 4 using the previous two structural lemmas. \Box

1.4. Learning {\mathrm{AC}^0}

The proof needs the following lemma:

Lemma 12 (LMN) If {f} is computable by a size {\leq s}, depth {\leq D} and bottom fan-in {\leq w} circuit, then

\displaystyle \sum_{|U|\geq(10w)^D} \hat{f}(U)^2 \leq O(s\cdot w^{-w}) \enspace.

The proof is not fully provided in the lecture notes. A first step is to choose {\star}-probability {(\frac{1}{10w})^{D-1}}, and then use Håstad’s switching lemma {D-1} times, each to two consecutive layers of the circuit (which is a DNF formula).

Theorem 13 {\mathrm{AC}^0} circuits of size {s} are learnable from random examples in time {n^{O(\log n)^D}}.

Proof: Using the LMN lemma, one can show that the influence {\mathbb{I}(f)\leq [O(\log s)]^D} (and this could actually be improved to {[O(\log s)]^{D-1}}. This is because:

\displaystyle \mathbb{I}(f)=\sum_U |U|\hat{f}(U)^2 = O(\log s)^D + \sum_{r=O(\log s)^D}^n \sum_{|U|>r} \hat{f}(U)^2 \enspace,

while the last term can be bounded using LMN width difference choices of {w}, after a truncation to reduce the bottom fan-in to {\log(\frac{s}{\varepsilon})}. \Box

1.5. Learning {r} juntas

Notice that we only need to identify the {r} relevant variables, and then we only need {O(r2^r)} random examples and with high probability can learn the entire truth table.

A simple algorithm that uses {\mathrm{poly}(n,2^r)\cdot n^r}-time algorithm with random examples is as follows. Since each non-zero coordinate {\hat{f}(S)} of the {r}-junta is a multiple of {2^{-r}}, it suffices to estimate {\hat{f}(S)} for all {|S|\leq r} up to accuracy {\frac{2^{-r}}{4}}. Also notice that if {\hat{f}(S)\neq 0}, all variables in set {S} are relevant.

A more advanced algorithm is as follows:

  • If there is an algorithm in time {n^\alpha \mathrm{poly}(n,2^r)} that can find one relevant variable, then using the same time one can find all relevant variables.
    Suppose we have found that coordinate {i} is relevant using {M} examples, then consider the two restrictions {f_{-1\rightarrow i}} and {f_{1\rightarrow i}}. Using another {2M\log(1/\delta)} examples one can have enough {M} samples with {x_i=+1} (resp. {-1}). In sum we need {2^k M \log(1/\delta)} samples, so still in time {n^\alpha \mathrm{poly}(n,2^r)}.
  • If {f} has a non-zero degree {\leq d} term in fourier expansion, we can estimate all low degree terms up to accuracy {\frac{2^d}{4}}, and detect it in time {n^d\mathrm{poly}(n,2^r)}.
  • If {f} has degree {e} over {\mathbb{F}_2}, then can learn it in time {n^{we}\mathrm{poly}(n)}, where {w=2.376} is the best matrix multiplication time. This is because {f} has to be multilinear over {\mathbb{F}_2}, and all {n^e} coefficients can be learned via equation solving.
  • If {f} satisfies {\hat{f}(S)=0} for all {0<|S|< d}, then {f} has degree at most {r-d} over {\mathbb{F}_2}. This is through adding {f} by a parity over {[r]} so all degree {i} terms become degree {r-i}, while parity is of degree {1} over {\mathbb{F}_2}.

Choosing {d=\frac{wr}{w+1}} gets the following theorem.

Theorem 14 The class of {r}-junta over {n} bits can be learned from random examples in time {n^{wr/(w+1)}\mathrm{poly}(n,2^r)}.

Advertisements
This entry was posted in Learning Theory, Readings. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s