This is a section of my reading notes to Prof. Ryan O’Donnell’s <Analysis of Boolean Functions>, I’ve summarized the learning algorithms in this post, which also functions as a personal reference.
To learn a function that is close to some requires quires. Proof: Use local decoding on each input .
Theorem 1 (Goldreich-Levin) Given query access to , there is a -time algorithm that outputs a list such that
- if then ;
- if then holds w.p. .
Proof: The proof starts with an estimation method (by sampling) to for . Actually,
Next, do the split-into-half algorithm: starting with and estimate its weight. If larger than then preserve (and further split); otherwise remove it. Eventually only sets are left.
Lemma 2 (Learning via spectral concentration) If knowing a collection that the function is -concentrated on, then can output some that is -close to in time using random samples only.
If is not known explicitly, but , then can learn in time using membership queries.
Proof: The former is by estimating each to within in time , and the latter is by using Goldreich-Levin Theorem 1 to get a good collection that is -concentrated on.
Lemma 3 (Learning via influence) The class is learnable in time using random samples.
Proof: Because , so if .
Lemma 4 (Learning via norm) Any class of functions is learnable in time using membership queries.
Proof: is actually concentrated on . Notice that , and then one can use Lemma 2.
1.2. Learning DT
Theorem 5 Decision trees of size are learnable from random examples in time .
Proof: Such DT tree is -close to some DT tree with depth at most , and then the and use Lemma 3.
Theorem 6 Decision trees of size are learnable from membership queries in time . This also applies to decision trees on parities.
Proof: Let be a path on the decision tree, and be its indicator function. The Fourier expansion of looks like where is the subset of variables on and .
This means that , and then adopt Lemma 4.
1.3. Learning DNF
Theorem 7 DNFs of size are learnable from random examples in time .
Proof: DNFs of size can be truncated to width , while the latter has influence , so one can directly adopt Lemma 3.
We can actually do better via membership queries. We need two structural lemmas Lemma 9 and Lemma 10 first, and the overall idea is similar to the -norm method because Lemma 10 is essentially an bound.
Lemma 8 (Håstad’s Switching Lemma) Let be a width computable DNF. When we apply a random restriction on the function with -probability , then
Lemma 9 DNFs of width are -concentrated on degree up to .
Proof: Using Håstad’s switching lemma with , we have . After some further calculations, one can show that for all :
At last, let we get the answer.
Lemma 10 If is a width- DNF, then
Proof: The proof is via a more general inequality which can be proved also using Håstad’s switching lemma.
Theorem 11 DNFs of size are learnable from membership queries in time .
Proof: First reduce DNFs to width , and then it is -concentrated on some collection of size . This set can be defined as follows:
The proof is essentially a copy of Lemma 4 using the previous two structural lemmas.
The proof needs the following lemma:
Lemma 12 (LMN) If is computable by a size , depth and bottom fan-in circuit, then
The proof is not fully provided in the lecture notes. A first step is to choose -probability , and then use Håstad’s switching lemma times, each to two consecutive layers of the circuit (which is a DNF formula).
Theorem 13 circuits of size are learnable from random examples in time .
Proof: Using the LMN lemma, one can show that the influence (and this could actually be improved to . This is because:
while the last term can be bounded using LMN width difference choices of , after a truncation to reduce the bottom fan-in to .
1.5. Learning juntas
Notice that we only need to identify the relevant variables, and then we only need random examples and with high probability can learn the entire truth table.
A simple algorithm that uses -time algorithm with random examples is as follows. Since each non-zero coordinate of the -junta is a multiple of , it suffices to estimate for all up to accuracy . Also notice that if , all variables in set are relevant.
A more advanced algorithm is as follows:
- If there is an algorithm in time that can find one relevant variable, then using the same time one can find all relevant variables.
Suppose we have found that coordinate is relevant using examples, then consider the two restrictions and . Using another examples one can have enough samples with (resp. ). In sum we need samples, so still in time .
- If has a non-zero degree term in fourier expansion, we can estimate all low degree terms up to accuracy , and detect it in time .
- If has degree over , then can learn it in time , where is the best matrix multiplication time. This is because has to be multilinear over , and all coefficients can be learned via equation solving.
- If satisfies for all , then has degree at most over . This is through adding by a parity over so all degree terms become degree , while parity is of degree over .
Choosing gets the following theorem.
Theorem 14 The class of -junta over bits can be learned from random examples in time .