## Learning Algorithms for DT, DNF, AC0 and r-juntas

This is a section of my reading notes to Prof. Ryan O’Donnell’s <Analysis of Boolean Functions>, I’ve summarized the learning algorithms in this post, which also functions as a personal reference.

 Class Random Examples Membership Queries poly-size DTs ${n^{O(\log(n/\varepsilon))}}$ Theorem 5 ${\mathrm{poly}(n/\varepsilon)}$ Theorem 6 poly-size DNFs ${n^{O(\log(n/\varepsilon))}}$ Theorem 7 ${\mathrm{poly}(n/\varepsilon)}$ (${n^{O(\log\log n)}}$ in Theorem 11) poly-size depth-d circuits ${n^{O(\log^{d-1}(n/\varepsilon))}}$ Theorem 13 same ${\log(n)}$-juntas ${n^{0.704\log(n)+O(1)}}$ Theorem 14 ${\mathrm{poly}(n)}$ (implied by Theorem 6)

1.1. Tools

To learn a function ${f}$ that is ${\varepsilon<\frac{1}{4}-c}$ close to some ${\chi_S}$ requires ${O(n\log \frac{n}{\delta})}$ quires. Proof: Use local decoding on each input ${e_i}$. $\Box$

Theorem 1 (Goldreich-Levin) Given query access to ${f:\{-1,1\}^n \rightarrow [-1,1]}$, there is a ${\mathrm{poly}(n,\frac{1}{\gamma}\log\frac{1}{\delta})}$-time algorithm that outputs a list ${L=\{S_1,\dots,S_m\}}$ such that

1. if ${\hat{f}(S)\geq \gamma}$ then ${S \in L}$;
2. if ${S\in L}$ then ${\hat{f}(S)\geq\frac{\gamma}{2}}$ holds w.p. ${1-\delta}$.

Proof: The proof starts with an estimation method (by sampling) to ${W(\mathcal{S}):=\sum_{U\in \mathcal{S}}\hat{f}(U)^2}$ for ${\mathcal{S}=(a_1,\dots,a_k,\star,\dots,\star)}$. Actually,

$\displaystyle W(\mathcal{S}) = \mathbb{E}_x [F_{S\subset I}(x)^2] = \mathbb{E}_x[\hat{f}_{x\rightarrow \bar{I}}(S)^2] \enspace.$

Next, do the split-into-half algorithm: starting with ${(\star,\dots,\star)}$ and estimate its weight. If larger than ${\gamma^2/2}$ then preserve (and further split); otherwise remove it. Eventually only sets ${S_1,\dots,S_m}$ are left. $\Box$

Lemma 2 (Learning via spectral concentration) If knowing a collection ${\mathcal{S}}$ that the function ${f}$ is ${\varepsilon}$-concentrated on, then can output some ${h}$ that is ${\varepsilon}$-close to ${f}$ in time ${\mathrm{poly}(|S|,1/\varepsilon,n)\log(1/\delta)}$ using random samples only.

If ${\mathcal{S}}$ is not known explicitly, but ${|\mathcal{S}|=M}$, then can learn ${f}$ in time ${\mathrm{poly}(M,1/\varepsilon,n)}$ using membership queries.

Proof: The former is by estimating each ${\hat{f}(S)}$ to within ${\pm \sqrt{\varepsilon}/(4\sqrt{|S|})}$ in time ${\mathrm{poly}(|S|,1/veps,n)}$, and the latter is by using Goldreich-Levin Theorem 1 to get a good collection ${\mathcal{S}'}$ that ${f}$ is ${2\varepsilon}$-concentrated on. $\Box$

Lemma 3 (Learning via influence) The class ${\mathcal{C}=\{f:\mathbb{I}(f)\leq d\}}$ is learnable in time ${n^{O(d/\varepsilon)}}$ using random samples.

Proof: Because ${\mathbb{I}(f)=\sum_S |S|\hat{f}(S)^2}$, so if ${\mathbb{I}(f)\leq d \Rightarrow \sum_{|S|>d/\varepsilon} \hat{f}(S)^2 \leq \varepsilon}$. $\Box$

Lemma 4 (Learning via ${L_1}$ norm) Any class of functions ${\mathcal{C}=\{f : \|f\|_2^2\leq 1\text{ and }\|\hat{f}\|_1 \leq s\}}$ is learnable in time ${\mathrm{poly}(s,1/\varepsilon)}$ using membership queries.

Proof: ${f}$ is actually concentrated on ${\mathcal{S}=\{S \subset [n] : |\hat{f}(S)|\geq \frac{\varepsilon}{\|\hat{f}\|_1}\}}$. Notice that ${|\mathcal{S}| \leq \Big(\frac{\|\hat{f}\|_1}{\varepsilon}\Big)^2}$, and then one can use Lemma 2. $\Box$

## 1.2. Learning DT

Theorem 5 Decision trees of size ${s}$ are learnable from random examples in time ${n^{O(\log(s/\varepsilon)/\varepsilon)}=n^{O(\log n)}}$.

Proof: Such DT tree is ${\varepsilon}$-close to some DT tree ${g}$ with depth at most ${\log(s/\varepsilon)}$, and then the ${\mathbb{I}(g)\leq \log(s/\varepsilon)}$ and use Lemma 3. $\Box$

Theorem 6 Decision trees of size ${s}$ are learnable from membership queries in time ${\mathrm{poly}(n)}$. This also applies to decision trees on parities.

Proof: Let ${P}$ be a path on the decision tree, and ${\mathbf{1}_P(x)}$ be its indicator function. The Fourier expansion of ${\mathbf{1}_P}$ looks like ${\sum_{S\subset V} \pm 2^{-d} X_S}$ where ${V}$ is the subset of variables on ${P}$ and ${d=|V|}$.

This means that ${\|\hat{f}\|_1 = \| \sum_P \hat{\mathbf{1}}_P \|_1 \leq \sum_P \| \hat{\mathbf{1}}_P \|_1 = s}$, and then adopt Lemma 4. $\Box$

## 1.3. Learning DNF

Theorem 7 DNFs of size ${s}$ are learnable from random examples in time ${n^{O(\log(s/\varepsilon)/\varepsilon)}=n^{O(\log n)}}$.

Proof: DNFs of size ${s}$ can be truncated to width ${w=\log(s/\varepsilon)}$, while the latter has influence ${\mathbb{I}(f)\leq 2w}$, so one can directly adopt Lemma 3. $\Box$

We can actually do better via membership queries. We need two structural lemmas Lemma 9 and Lemma 10 first, and the overall idea is similar to the ${L_1}$-norm method because Lemma 10 is essentially an ${L_1}$ bound.

Lemma 8 (Håstad’s Switching Lemma) Let ${f:\{-1,1\}^n\rightarrow \{-1,1\}}$ be a width ${w}$ computable DNF. When we apply a random restriction on the function ${f}$ with ${\star}$-probability ${\rho}$, then

$\displaystyle \Pr_{(I,X)}[ \mathrm{DT-depth}(f_{X\rightarrow \bar{I}})>d ] \leq (5\rho w)^d \enspace.$

Lemma 9 DNFs of width ${w}$ are ${\varepsilon}$-concentrated on degree up to ${O(w\log(\frac{1}{\varepsilon}))}$.

Proof: Using Håstad’s switching lemma with ${\rho=\frac{1}{10w}}$, we have ${\sum_{S\subset I,|S|>d} \hat{f}_{X\rightarrow \bar{I}}(S)^2 \leq 2^{-d}}$. After some further calculations, one can show that for all ${d\geq 5}$:

$\displaystyle \sum_{|U|\geq 20dw}\hat{f}(U)^2 \leq 2^{-d+1} \enspace.$

At last, let ${d=\log(1/\varepsilon)}$ we get the answer. $\Box$

Lemma 10 If ${f}$ is a width-${w}$ DNF, then

$\displaystyle \sum_{|U|\leq O(w\log{\frac{1}{\varepsilon}})} |\hat{f}(U)| \leq w^{O(w\log \frac{1}{\varepsilon})} \enspace.$

Proof: The proof is via a more general inequality ${\sum_U \Big(\frac{1}{20w}\Big)^{|U|} |\hat{f}(U)| \leq 2}$ which can be proved also using Håstad’s switching lemma. $\Box$

Theorem 11 DNFs of size ${\mathrm{poly}(n)}$ are learnable from membership queries in time ${\Big(\frac{n}{\varepsilon}\Big)^{O(\log\log(n/\varepsilon)\cdot \log\frac{1}{\varepsilon})}=n^{O(\log\log n)}}$.

Proof: First reduce DNFs to width ${w=O(\log\frac{n}{\varepsilon})}$, and then it is ${\varepsilon}$-concentrated on some collection of size ${w^{O(w\log\frac{1}{\varepsilon})}}$. This set can be defined as follows:

$\displaystyle \mathcal{S} = \big\{ U : |U|\leq O(w\log \frac{1}{\varepsilon}), |\hat{f}(U)| \geq \frac{\varepsilon}{w^{O(w\log\frac{1}{\varepsilon})}} \big\} \enspace.$

The proof is essentially a copy of Lemma 4 using the previous two structural lemmas. $\Box$

## 1.4. Learning ${\mathrm{AC}^0}$

The proof needs the following lemma:

Lemma 12 (LMN) If ${f}$ is computable by a size ${\leq s}$, depth ${\leq D}$ and bottom fan-in ${\leq w}$ circuit, then

$\displaystyle \sum_{|U|\geq(10w)^D} \hat{f}(U)^2 \leq O(s\cdot w^{-w}) \enspace.$

The proof is not fully provided in the lecture notes. A first step is to choose ${\star}$-probability ${(\frac{1}{10w})^{D-1}}$, and then use Håstad’s switching lemma ${D-1}$ times, each to two consecutive layers of the circuit (which is a DNF formula).

Theorem 13 ${\mathrm{AC}^0}$ circuits of size ${s}$ are learnable from random examples in time ${n^{O(\log n)^D}}$.

Proof: Using the LMN lemma, one can show that the influence ${\mathbb{I}(f)\leq [O(\log s)]^D}$ (and this could actually be improved to ${[O(\log s)]^{D-1}}$. This is because:

$\displaystyle \mathbb{I}(f)=\sum_U |U|\hat{f}(U)^2 = O(\log s)^D + \sum_{r=O(\log s)^D}^n \sum_{|U|>r} \hat{f}(U)^2 \enspace,$

while the last term can be bounded using LMN width difference choices of ${w}$, after a truncation to reduce the bottom fan-in to ${\log(\frac{s}{\varepsilon})}$. $\Box$

## 1.5. Learning ${r}$ juntas

Notice that we only need to identify the ${r}$ relevant variables, and then we only need ${O(r2^r)}$ random examples and with high probability can learn the entire truth table.

A simple algorithm that uses ${\mathrm{poly}(n,2^r)\cdot n^r}$-time algorithm with random examples is as follows. Since each non-zero coordinate ${\hat{f}(S)}$ of the ${r}$-junta is a multiple of ${2^{-r}}$, it suffices to estimate ${\hat{f}(S)}$ for all ${|S|\leq r}$ up to accuracy ${\frac{2^{-r}}{4}}$. Also notice that if ${\hat{f}(S)\neq 0}$, all variables in set ${S}$ are relevant.

A more advanced algorithm is as follows:

• If there is an algorithm in time ${n^\alpha \mathrm{poly}(n,2^r)}$ that can find one relevant variable, then using the same time one can find all relevant variables.
Suppose we have found that coordinate ${i}$ is relevant using ${M}$ examples, then consider the two restrictions ${f_{-1\rightarrow i}}$ and ${f_{1\rightarrow i}}$. Using another ${2M\log(1/\delta)}$ examples one can have enough ${M}$ samples with ${x_i=+1}$ (resp. ${-1}$). In sum we need ${2^k M \log(1/\delta)}$ samples, so still in time ${n^\alpha \mathrm{poly}(n,2^r)}$.
• If ${f}$ has a non-zero degree ${\leq d}$ term in fourier expansion, we can estimate all low degree terms up to accuracy ${\frac{2^d}{4}}$, and detect it in time ${n^d\mathrm{poly}(n,2^r)}$.
• If ${f}$ has degree ${e}$ over ${\mathbb{F}_2}$, then can learn it in time ${n^{we}\mathrm{poly}(n)}$, where ${w=2.376}$ is the best matrix multiplication time. This is because ${f}$ has to be multilinear over ${\mathbb{F}_2}$, and all ${n^e}$ coefficients can be learned via equation solving.
• If ${f}$ satisfies ${\hat{f}(S)=0}$ for all ${0<|S|< d}$, then ${f}$ has degree at most ${r-d}$ over ${\mathbb{F}_2}$. This is through adding ${f}$ by a parity over ${[r]}$ so all degree ${i}$ terms become degree ${r-i}$, while parity is of degree ${1}$ over ${\mathbb{F}_2}$.

Choosing ${d=\frac{wr}{w+1}}$ gets the following theorem.

Theorem 14 The class of ${r}$-junta over ${n}$ bits can be learned from random examples in time ${n^{wr/(w+1)}\mathrm{poly}(n,2^r)}$.