Exercise 12.20: Support Vector Machines and Kernel Methods
(a)
Since the linear span of \(\{ k(\cdot, x_i) \}_{i = 1}^n\) is finite dimensional, it is closed. Thus we can decompose any \(f \in \Hb\) into two orthogonal components \(f = f_\parallel + f_\bot\) where \(f_\parallel\) lies in the linear span, and therefore \begin{align} f (x_i) = \langle f, k(\cdot, x_i) \rangle = \langle f_\parallel, k(\cdot, x_i) \rangle \, , \end{align} for all \(i \in [n]\). Hence \(f_\bot\) only affects the regulariser \(\| f \|_\Hb^2 = \| f_\parallel \|_\Hb^2 + \| f_\bot \|_\Hb^2\) which is minimised when \(f_\bot = 0\).
(b)
Let \(\xi = [ \xi_1, \ldots, \xi_n ] \in \R^n\) be an auxiliary variable which upper bounds the observation-wise hinge losses: \(\xi_i \geq 0\) and \(\xi_i \geq 1 - y_i f(x_i)\) for all \(i \in [n]\). With this constraint and \(f = \sum_{i=1}^n \alpha_i k(\cdot, x_i)\) for some \(\alpha \in \R^n\) by (a) \begin{align} \min_{f} \biggl\lbrace \frac{1}{n}\sum_{i=1}^n \max \lbrace 0, 1 - y_i f(x_i) \rbrace + \frac{\lambda_n}{2} \| f \|_\Hb^2 \biggr\rbrace = \min_{\alpha, \xi} \biggl\lbrace \frac{1}{\lambda_n n} \sum_{i=1}^n \xi_i + \frac{1}{2} \alpha^\top K \alpha \biggr\rbrace \, . \end{align} where \(K_{ij} = k(x_i, x_j)\). Defining \(\eta = 1 / (\lambda_n n)\), and \(\mathbf{1} = [1, \ldots, 1] \in \R^n\), we can introduce the dual variables \(\mu , \nu \in \R^n\), \(\mu_i \leq 0\) and \(\nu_i \geq 0\) for all \(i\), and write the Lagrangian \begin{align} \mathcal{L}(\alpha, \xi, \mu, \nu) &= \eta \xi^\top \mathbf{1} + \tfrac{1}{2} \alpha^\top K \alpha + \mu^\top \xi - \nu^\top (\xi - 1) - (\nu \odot y)^\top K \alpha \newline &= \nu^\top \mathbf{1} + \xi^\top (\eta \mathbf{1} + \mu - \nu) + \tfrac{1}{2} \alpha^\top K \alpha - (\nu \odot y)^\top K \alpha \, , \end{align} where \(\odot\) denotes the Hadamard product. Setting the derivatives w.r.t. the primal variables to zero \begin{align} \frac{\partial \mathcal{L}}{\partial \alpha} = K \alpha - K (\nu \odot y) = 0 \, , \qquad \frac{\partial \mathcal{L}}{\partial \xi} = \eta \mathbf{1} + \mu - \nu = 0 \, , \end{align} and substituting into the Lagrangian, we obtain \begin{align} \mathcal{L}(\nu) = \sum_{i=1}^n \nu_i - \frac{1}{2} \nu^\top \widetilde{K} \nu \, , \end{align} where \(\widetilde{K}_{ij} = y_i y_j k(x_i, x_j)\). We can thus eliminate the \(\mu\) variables by introducing the inequality constrain \(\nu_i \leq \eta\), for all \(i\), which comes from the above \(\eta \mathbf{1} + \mu - \nu = 0\) combined with \(\mu_i \leq 0\) (definition). Since \(\nu_i \geq 0\) (also by definition), we maximise over \(\nu_i \in [0, 1 / (\lambda_n n)]\). This is equivalent to the desired result.