HDS

Exercise 4.6: Too Many Linear Classifiers

chapter 4

For any fixed matrix \(X = [ x_1; \ldots; x_n ] \in \R^{n \times d}\) and a vector \(\varepsilon = ( \varepsilon_1 , \ldots , \varepsilon_n ) \in \R^n\), there exists at least one solution \(\alpha \in \R^d\) to the linear system \begin{align} X \alpha = \varepsilon \, , \end{align} since \(d \geq n\) by assumption. Because \(\varepsilon_i \in \{ -1, 1 \}\), \(\alpha = 0\) cannot be a solution. Hence we can take \(\theta = \alpha / \| \alpha \|_2\) so that \(\sign(\langle x, \theta \rangle) = \sign(\langle x , \alpha \rangle)\) for all \(x\), yielding \begin{align} \E_\varepsilon \biggl[ \sup_{f \in \mathscr{F}} \biggl| \frac{1}{n} \sum_{i=1}^n \varepsilon_i f(x_i) \biggr| \biggr] = \frac{1}{n} \E_\varepsilon \biggl[ \sum_{i=1}^n \varepsilon_i \sign(\varepsilon_i) \biggr] = 1 \, . \end{align} By Proposition 4.12, the above implies the class of linear classifiers is too rich when we are in the high-dimensional regime (\(d \geq n\)), and thus the use of (unpenalised) empirical risk minimisation will lead to overfitting.

Published on 30 October 2020.