Kernel Machines I

36-465/665, Spring 2021

6 April 2021 (Lecture 18)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Housekeeping

Previously

Previously (2)

What’s left?

Kernel machines

Unpacking the kernel

Linear methods with nonlinear features

We might want a lot of nonlinear features

We might want a lot of nonlinear features (2)

We might want a lot of nonlinear features (3)

Explicitly using lots of features would be slow

Back to linear methods

The dual view

Back to features

Features vs. kernels

Rademacher complexity of kernel machines

Think about what this would mean for linear models

What sorts of things can we do with kernel machines?

Summing up

Backup: a matrix identity

For any matrix \(\mathbf{v}\) and any scalar \(\lambda\), \[ \mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} = (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1} \mathbf{v} \] when both inverses exist

Proof: \[\begin{eqnarray} \mathbf{v}\mathbf{v}^T\mathbf{v} + \lambda \mathbf{v} & = & \mathbf{v}\mathbf{v}^T\mathbf{v} + \lambda \mathbf{v}\\ (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})\mathbf{v} & = & \mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ (\mathbf{v}\mathbf{v}^T + \lambda\mathbf{I})^{-1} (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})\mathbf{v} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ \mathbf{v} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ \mathbf{v} (\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1}\\ \mathbf{v} (\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v} ~ \Box \end{eqnarray}\]

References