Exercise 5.8: Lower Bounds for $\ell_0$-"Balls"
(a)
By Exercise 5.2, it is sufficient to derive a lower bound on the packing number, which is also what we will need in (b). The basic idea we employ is a combination of a Gilbert-Varshamov like argument for packing of Hamming cubes, and subsequent embedding of the obtained packing into \(\mathbb{T}^d(s)\) by rescaling.
Starting with the Hamming cube packing, we invoke lemma 4.10 from (Massart, 2006), which—for any \(1 \leq s_0 \leq \frac{d}{2}\)—establishes existence of a packing \(m_1 , \ldots , m_{N_0} \in \lbrace 0, 1 \rbrace_{s_0}^d\) with \(\lbrace 0, 1 \rbrace_{s_0}^d = \lbrace m \in \lbrace 0, 1 \rbrace^d \colon \| m \|_0 = s_0 \rbrace\), such that:
- For every \(i \neq j\), \(\| m_i - m_j \|_0 \geq 2 \bigl( 1 - \tfrac{1}{\sqrt{2}} \bigr) s_0 \, .\)
- \(\log N_0 \geq c s_0 \log \frac{d}{s_0}\), with \(c \approx \tfrac{1}{20}\) a universal constant.
The argument used by Massart is Gilbert-Varshamov like:
Sketch proof: Any \(\varepsilon\)-packing of \(\lbrace 0 , 1 \rbrace_{s_0}^n\) must satisfy \(\lbrace 0 , 1 \rbrace_{s_0}^d \subseteq \bigcup_{i \in [N_0]} B_{\varepsilon}(m_i)\) in order to be maximal, implying \(|\lbrace 0 , 1 \rbrace_{s_0}^d| \leq \sum_{i \in [N_0]} |B_{\varepsilon}(m_i)|\), with the balls defined w.r.t. the (Hamming) metric \(\| \cdot \|_0\). If we were to go down the vanilla Gilbert-Varshamov route, we would now upper bound the r.h.s. by \(N_0 \max_{i \in N_0} |B_{\varepsilon} (m_i)|\), thus obtaining a lower bound for \(N_0\).
The argument of Massart has similar flavour, but is more refined. Setting \(\varepsilon = 2(1 - \tfrac{1}{\sqrt{2}}) s_0\), one can observe \(y \in B_\varepsilon(m_i)\) iff \(|\lbrace i \colon y_i = m_i \rbrace| \geq \tfrac{s_0}{\sqrt{2}}\), implying \begin{align} 1 \leq \sum_{i \in [N_0]} \sum_{y \in \lbrace 0 , 1 \rbrace_{s_0}^d} \frac{1}{|\lbrace 0 , 1 \rbrace_{s_0}^d|} \ind \lbrace |\lbrace i \colon y_i = m_i \rbrace| \geq \tfrac{s_0}{\sqrt{2}} \rbrace = N_0 \, \P \bigl(X \geq \tfrac{s_0}{\sqrt{2}}\bigr) \, , \end{align} where \(X\) follows a hypergeometric distribution which draws \(s_0\) times from a population of size \(d\) with \(s_0\) positives. \(\P \bigl(X \geq \tfrac{s_0}{\sqrt{2}}\bigr)\) can then be controlled using the Chernoff bound, where Massart employs a nice argument based on the work of Aldous who shows that the convex conjugate of the hypergeometric MGF is lower bounded by the binomial one with appropriately chosen parameters.
The result of Massart is almost what we need except for the absence of the \(s_0 \log e\) term. Note that the packing number of \(\mathbb{T}^d(s)\) is at least twice that of \(\lbrace 0, 1\rbrace_{s_0}^d\) since for any \(i \in [N_0]\), we can embed both \(\theta_{i} = \tfrac{m_i}{\sqrt{s_0}}\) and \(\theta_{2i} = -\tfrac{m_i}{\sqrt{s_0}}\), while preserving \begin{align} \| \theta_i - \theta_j \|_2 \geq \sqrt{ \tfrac{1}{s_0} 2 (1 - \tfrac{1}{\sqrt{2}}) s_0 } = \sqrt{2 - \sqrt{2}} \gt \tfrac{1}{\sqrt{2}} \, , \end{align} for any \(i \neq j\). Hence \(2 N_0\) lower bounds the desired \(\tfrac{1}{\sqrt{2}}\)-packing number. Choosing \(s_0 = \lceil \tfrac{s}{2} \rceil\), and using the change of logarithm base formula, we obtain \begin{align} \log_2 N &\geq 1 + \frac{c s}{2} \log_2 \frac{2 d}{s} - \frac{c s}{2} \log_2 \frac{s + 1}{s} \newline &\overset{\text{(i)}}{\geq} 1 + \frac{c s}{2} \log_2 \frac{2 d}{s} - \frac{c}{2 \log 2} \newline &\overset{\text{(ii)}}{>} \frac{c s}{2} \log_2 \frac{2 d}{s} \, , \end{align} where (i) follows by \(\log_2 (1 + 1 / x) = \frac{\log(1 + 1 / x)}{\log 2}\) combined with \(1 + 1/x \leq e^{1/x}\), and (ii) by using the explicit value for \(c\) from Massart and comparing the resulting last term to the leading \(1\). Hence \(\log_2 N \gtrsim s (1 + \log_2 \tfrac{d}{s})\), i.e., \begin{align}\label{eq:packing_bound} \log N \gtrsim s \bigl(\log 2 + \log \frac{d}{s}\bigr) \gtrsim s \log \frac{ed}{s} \, , \end{align} with the last inequality by \(s \leq d\) combined with \(0 < \log 2 < 1\).
(a) Alternative Solution
See the note about the Gilbert–Varshamov bound.
(b)
Defining \(X_\theta = \langle \theta, w \rangle\), \(w \sim \gauss(0, I_d)\), \(\lbrace X_\theta \rbrace_{\theta \in \mathbb{T}^d(s)}\) is a centred Gaussian process. We can therefore apply Sudakov minoration (Theorem 5.30) to obtain \begin{align} \mathcal{G}(\mathbb{T}^d(s)) = \E \bigl[ \sup_{\theta \in \mathbb{T}^d(s)} X_\theta \bigr] \geq \sup_{\delta > 0} \frac{\delta}{2} \sqrt{\log M_\rho(\delta ; \mathbb{T}^{d}(s))} \, , \end{align} where \(\rho(\theta_i , \theta_j)^2 = \E [(\langle w, \theta_i - \theta_j \rangle)^2] = \| \theta_i - \theta_j \|_2^2\). All we need is therefore to substitute the lower bound on the packing number from Equation \eqref{eq:packing_bound} with \(\delta = \frac{1}{\sqrt{2}}\).