\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Housekeeping

Unfortunately cannot hold office hours today after class
Will hold make-up office hours after class on Thursday

Previously

Decision problems: state \(Y\), actions \(A\), loss \(\Loss(y,a)\), information \(X\), strategies \(s: X \mapsto A\), risk \(\Risk(s) \equiv \Expect{\Loss(Y, s(X))}\), class of strategies \(\ModelClass\), optimal strategy \(\OptimalModel = \argmin_{s \in \ModelClass}{\Risk(s)}\)
- When actions are predictions, we talk about “models”
Data \((X_1, Y_1), \ldots (X_n, Y_n)\), empirical risk \(\EmpRisk(s) \equiv n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}\), empirical risk minimizer \(\hat{s} \equiv \argmin_{s \in \ModelClass}{\EmpRisk(s)}\)
- Also regularized estimates, like \(\argmin_{s \in \ModelClass}{\EmpRisk(s) + \lambda \Penalty(s)}\) or \(\argmin_{s: \Penalty(s) \leq c}{\EmpRisk(s)}\)
We want to control \(\Risk(\hat{s}) - \EmpRisk(\hat{s})\) (“generalization error bound”) and \(\Risk(\hat{s}) - \Risk(\OptimalModel)\) (“oracle inequality”)

Previously (2)

Uniform convergence of risks, \(\max_{s \in \ModelClass}{|\Risk(s) - \EmpRisk(s)|} \rightarrow 0\), gives us both generalization error bounds and oracle inequalities
- Asymptotic approximations give us a point estimate of generalization error (“optimism”)
- Algorithmic stability alone just gives us a generalization error bound
Uniform convergence had two ingredients:
- Deviation inequalities: Markov \(\Rightarrow\) Chernoff \(\Rightarrow\) Hoeffding \(\Rightarrow\) bounded-difference
- Capacity control: Rademacher \(\Rightarrow\) growth function \(\Rightarrow\) VC dimension
Model selection: with multiple classes \(\ModelClass_1, \ModelClass_2, \ldots \ModelClass_q\), pick the one which will generalize best
- “Generalizes best” might mean \(\min_{k}{\Risk(\OptimalModel_k)}\) (harder)
- Or it might mean \(\min_{k}{\Risk(\hat{s}_k)}\) (easier)
- More penalized optimization, but optimizing a discrete variable (\(k\)), more good choices for the penalty
- Model averaging sometimes gets around picking one model class

What’s left?

Looking at how this applies to particular types of models
This week: “kernel machines”
Next week: “random feature machines”, a.k.a. “kitchen sinks”
Both of these are about using linear methods on nonlinear transformations (“features”) of the data

Kernel machines

A kernel \(K(x, z)\) takes two input values and gives a real number, with the following restrictions:
- Symmetry: \(K(x, z) = K(z, x)\)
- Cauchy-Schwarz: \(K^2(x,z) \leq K(x,x) K(z,z)\)
- Mercer: For any \(x_1, x_2, \ldots x_n\), make \(\mathbf{K}\) the \(n\times n\) matrix \(K_{ij} = K(x_i, x_j)\). Then \(\mathbf{K}\) is positive semi-definite a.k.a. non-negative definite, meaning that \(v \cdot \mathbf{K} v \geq 0\) for any vector \(v\)
  - Equivalently, for any nice function \(f\), \(\int{K(x_1,x_2) f(x_1) f(x_2) dx_1 dx_2} \geq 0\)
\(K(x,z)\) is some kind of measure of how similar \(x\) and \(z\) are
Kernel machines are strategies of the form \(s(x) = \sum_{i=1}^{m}{\alpha_i K(x, x_i)}\), for some \(m\) and some set of centers (or knots) \(x_i\)
- Or maybe we apply a threshold to this sum when we need a discrete output
- Often, but not always, \(m=n\), number of training points, and centers \(x_i=\) the training vectors

Unpacking the kernel

If we make all the assumptions, then for some constants \(\lambda_j\) and basis functions \(\phi_j\) \[ K(x,z) = \sum_{j=1}^{\infty}{\lambda_j \phi_j(x) \phi_j(z)} \]
These are eigenvalues and eigenfunctions of a linear integral operator \[ \int{K(x, z) \phi_j(x) dx} = \lambda_j \phi_j(z) \]
We can chose the eigenfunctions so that \(\int{\phi_j(x) \phi_{j^{\prime}}(x) dx} = \delta_{j j^{\prime}}\) (orthonormal basis)
Non-negativity means \(\lambda_j \geq 0\)
So a kernel model is \[\begin{eqnarray} \sum_{i=1}^{m}{\alpha_i K(x, x_i)} & = & \sum_{i=1}^{m}{\alpha_i \sum_{j=1}^{\infty}{\lambda_j \phi_j(x) \phi_j(x_i)}}\\ & = & \sum_{j=1}^{\infty}{\lambda_j \left(\sum_{i=1}^{m}{\alpha_i \phi_j(x_i)}\right) \phi_j(x)}\\ & = & \sum_{j=1}^{\infty}{\beta_j \phi_j(x)} \end{eqnarray}\]
\(\Rightarrow\) Kernel models are linear models in transformations of the original variables

Linear methods with nonlinear features

Think about linear regression
- Intercepts are annoying, so assume \(\Expect{Y} = 0\), \(\Expect{X} = 0\)
We’re used to writing this as \[ s(x) = \sum_{j=1}^{p}{\beta_j x^{(j)}} = \beta \cdot x \]
- \(x^{(j)} =\) coordinate \(j\) of vector \(x\) (because we’re using \(x_j\) for data point \(j\))
The math doesn’t care about where the coordinates of \(x\) come from, so we can also do \[ s(x) = \sum_{j=1}^{r}{\beta_j \phi_j(x)} = \beta \cdot \phi(x) \]
- The transformations \(\phi_j\) are features
- \(\phi: X \mapsto \mathbb{R}^r\) transforms data into \(r\)-dimensional feature vectors
- \(s(x)\) is linear in the features, but nonlinear in \(x\)
Same trick works for linear classifiers, linear dimensionality reduction (PCA), etc., etc.

We might want a lot of nonlinear features

We might want a lot of nonlinear features (2)

Impossible to solve this classification problem with a linear method in the original coordinates

We might want a lot of nonlinear features (3)

Easy to solve linearly if we have \(x_1^2\) and \(x_2^2\)

But we don’t know in advance what nonlinear features we’ll need, and often we’ll need more features than input coordinates
- Imagine \(x\) is 10 or 100 dimensional to start with…
Solution: use many nonlinear features \(\phi_1, \phi_2, \ldots\) and estimate which ones we need in a particular situation

Explicitly using lots of features would be slow

If we’re using \(r\) features, then \[ s(x) = \sum_{j=1}^{r}{\beta_j \phi_j(x)} \] takes (at least) \(O(r)\) time to calculate…
Do we really care about the feature values?

Back to linear methods

An important but not obvious fact about linear methods: we can look at them as combinations of features, or as combinations of data points, and these representations are “dual” to each other
Go back to linear regression with ordinary least squares: \[ \hat{s}(x) = x \cdot \hat{\beta} = x (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} \]
- Remember here \(x\) is a vector (\([1\times p]\)) and \(\mathbf{x}\) is the training data (\([n \times p]\)) \[\begin{eqnarray} x \cdot \hat{\beta} & = & x (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y} & = & ( x (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y})^T & = & (\mathbf{x}^T\mathbf{x})^{-1}\mathbf{x}^T \mathbf{y})^T x^T\\ & = & \mathbf{y}^T (\mathbf{x}\mathbf{x}^T)^{-1}\mathbf{x} x^T \end{eqnarray}\] using a matrix identity proved in the back-up slides
Why this matters:
- \(\mathbf{x}\mathbf{x}^T\) is the \([n\times n]\) matrix of all the inner products for the training vectors
- \(\mathbf{x} x^T\) is the \([n \times 1]\) matrix of inner products of training vectors with the new vector \(x\)
\(\Rightarrow\) linear regression only needs the inner products, not the actual vectors

The dual view

In general, whenever we can write \[ \beta \cdot x \] we can also write \[ \alpha \cdot \mathbf{x} x^T \]
- \(\beta\) has as many dimensions as \(x\), but \(\alpha\) has as many dimensions as there are rows in \(\mathbf{x}\), i.e., \(n\)

Back to features

Sum over features \(\Leftrightarrow\) sum over training data points
Computing each feature \(\Leftrightarrow\) taking inner product with each training vector in the feature space
We don’t need to actually compute the features if we can take the inner products
The kernel gives us a way to take the inner products without first computing the features \[ K(x, z) = \sum_{j=1}^{\infty}{\lambda_j \phi_j(x) \phi_j(z)} \]
- It’s a weighted inner product, but that’s OK…
Example: Gaussian kernel \[ K(x, z) = \myexp{-\|x-z\|^2/2} = \sum_{j=0}^{\infty}{\frac{(-1)^j}{j!2^j} (\|x\|^2 + \|z\|^2 - 2x\cdot z)^{j}} \]
- polynomial terms out to all orders, without having to explicitly calculate them all
- less and less weight to higher and higher powers
- If we used a bandwidth, \(\myexp{-\|x-z\|^2/2h^2}\), we could control the relative weight of high powers

Features vs. kernels

When we write a kernel model \[ s(x) = \sum_{i=1}^{m}{\alpha_i K(x, x_i)} \] we’re also writing a nonlinear-features model \[ s(x) = \sum_{j=1}^{\infty}{\beta_j \phi_j(x)} \] without having to calculate infinitely many (or any) features
Clear computational advantages, but what about learning? Isn’t that a lot of dimensions?

Rademacher complexity of kernel machines

Kernel matrix \(\mathbf{K} =\) the \([n\times n]\) matrix where \(K_{ij} = K(x_i, x_j)\)
Use a restricted model class \(\ModelClass\) where \(s(x) = \beta \cdot \phi(x)\) and \(\|\beta\| \leq c\)
Also assume \(\Prob{K(X, X) \leq r^2}=1\)
Then \[ \EmpRademacher_n(\ModelClass) \leq \frac{c\sqrt{\tr{\mathbf{K}}}}{n} \leq \frac{c r}{\sqrt{n}} \] so \[ \Rademacher_n(\ModelClass) = \Expect{\EmpRademacher_n(\ModelClass)} \leq \frac{c r}{\sqrt{n}} \]
- We’ll prove this in HW 10

Think about what this would mean for linear models

Linear models are kernel machines with a very easy kernel function, \(K(x,z) = x \cdot z\)
\(K(x,x) \leq r^2\) would mean \(\|x\|^2 \leq r^2\): no data vector is too big
\(|\beta\| \leq c\): the coefficient vector can’t be too big
- This is what we used ridge regression to ensure
Implication: if the data vectors can’t get too big, and the coefficient vector also can’t get too big, then the Rademacher complexity is \(O(1/\sqrt{n})\) (if not smaller)

What sorts of things can we do with kernel machines?

Everything we can do with linear models
Linear classifiers \(\Rightarrow\) kernel classifiers
Linear regression \(\Rightarrow\) kernel regression
Ridge regression \(\Rightarrow\) kernel ridge regression
- Explicitly: \(s(x) = \mathbf{y}^T (\mathbf{x}\mathbf{x}^T + \lambda \mathbf{I})^{-1}\mathbf{x} x^T\)
- KRR has Rademacher complexity \(O(1/\sqrt{n})\) so it generalizes well
Kernel principal components, etc., etc.

Summing up

Kernel models or machines have the form \(\sum_{i=1}^{m}{\alpha_i K(x, x_i)}\)
\(K\) has to satisfy the kernel properties
Those properties mean that \(K\) is an inner product in an (implicit) feature space
- That feature space may even have infinitely many dimensions
Linear methods really only need inner products, so
Kernel machines are linear models using many nonlinear features of the inputs, without having to calculate all those features
If we apply a ridge penalty, we have \(\Rademacher_n = O(1/\sqrt{n})\) and good generalization
Next time:
- picking kernels for the problem at hand
- other limits on complexity, based on “margin” and/or giving 0 weight to some training points

Backup: a matrix identity

For any matrix \(\mathbf{v}\) and any scalar \(\lambda\), \[ \mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} = (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1} \mathbf{v} \] when both inverses exist

Proof: \[\begin{eqnarray} \mathbf{v}\mathbf{v}^T\mathbf{v} + \lambda \mathbf{v} & = & \mathbf{v}\mathbf{v}^T\mathbf{v} + \lambda \mathbf{v}\\ (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})\mathbf{v} & = & \mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ (\mathbf{v}\mathbf{v}^T + \lambda\mathbf{I})^{-1} (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})\mathbf{v} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ \mathbf{v} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})\\ \mathbf{v} (\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v}(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})(\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1}\\ \mathbf{v} (\mathbf{v}^T\mathbf{v} + \lambda \mathbf{I})^{-1} & = & (\mathbf{v}\mathbf{v}^T + \lambda \mathbf{I})^{-1}\mathbf{v} ~ \Box \end{eqnarray}\]

Kernel Machines I