\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Previously

We want models/strategies with low risk \(=\) low expected loss on new data
Within each model class, we estimate by empirical risk mimization, or by openalized ERM
- Adding a penalty is equivalent to imposing a constraint (Lagrange)
Within a model class, we can control the risk
- Oracle inequality \(=\) how far is our risk from the best-in-class risk?
- Generalization error bound \(=\) how far is our risk from the empirical risk?
We now have ways to pick among model classes / pick how much to regularize
- Cross-validation (especially leave-one-out when time allows)
- Hold-out
- Large-sample approximations to LOOCV like \(n^{-1} \tr{\mathbf{j}\mathbf{k}}^{-1}\) or AIC \(d/n\) for \(d\)-parameter models

What we’re up to today

Model selection which will converge on the best attainable risk across all our models
Need to arrange the models in a hierarchy or “structure”
Then we use generalization error bounds to decide which model class to go to
It’s like using a penalty, but it’s one related to the goal of generalizing well

Penalties for model selection

We’ve used penalties for fitting within a model
We can also view penalties as how we do model selection

Examples of model selection penalties

Clearest for AIC, where we maximize \[ AIC_i = \log{p(z_{1:n}; \hat{\theta}_i)} - d_i \] across model classes \(i\)
- Maximum likelihood minus a penalty
Equivalently, we use the log probability loss and minimize \[ \hat{r}(\hat{\theta}_i) + d_i/n \]
- ERM (with the log loss) plus a penalty
- Penalty gets bigger as there are more parameters to estimate
Optimism is also a penalty: \[ \hat{r}(\hat{s}_i) + n^{-1}\tr{\mathbf{j}_i \mathbf{k}_{i}^{-1}} \]
- Penalty gets bigger as there’s more noise in the loss function (variance of the gradient \(\mathbf{j}\) grows) and as the minimum gets flatter/more vulnerable to that noise (Hessian/curvature matrix \(\mathbf{k}\) shrinks)
- Penalty gets bigger as the parameters get harder to estimate

Examples of model selection penalties (2)

We can also see CV and hold-out as penalties
Easier to write out for hold-out, where we split data into training set \(D_t\) and selection set \(D_s\) \[\begin{eqnarray} \HoldoutRisk(\hat{s}_i) & = & \EmpRisk(\hat{s}_i) + (\HoldoutRisk(\hat{s}_i) - \EmpRisk(\hat{s}_i))\\ \HoldoutRisk(\hat{s}_i) - \EmpRisk(\hat{s}_i) & = & \frac{1}{n_s}\sum_{j \in D_s}{\Loss(z_j, \hat{s}_i)} - \frac{1}{n_t}\sum_{j \in D_t}{\Loss(z_j, \hat{s}_i)} \end{eqnarray}\]
This is a random, data-dependent penalty
Penalty grows as the performance of the model on two different data sets gets further apart
This is like what we called “discrepancy” when we were building our way to the Rademacher complexity
- It’s also not quite the same as discrepancy

What’d be a good penalty?

We’d like to pick the model class \(\ModelClass_i\) which minimizes \(\Risk(\hat{s}_i)\)
What we have instead is \(\EmpRisk(\hat{s}_i)\)
How are they related? \[ \Risk(\hat{s}_i) = \EmpRisk(\hat{s}_i) + (\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)) \]
- The fluctuation term \((\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i))\) will be bigger for higher-capacity, more complex models (all else being equal)
- The fluctuation will be smaller for larger \(n\) (all else being equal)

Good penalties approximate the over-fitting

The best penalty would be \((\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i))\)
AIC, optimism, etc., try to approximate \(\Expect{\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)}\)
CV, holdout, are more direct estimates of \(\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)\)
But this expression \(\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)\) should look familiar

Model-selection penalties vs. generalization error bounds

The penalty we want is \[ \Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i) \]
A generalization error bound for \(\ModelClass_i\) says \[ \Prob{\Risk(\hat{s}_i) \geq \EmpRisk(\hat{s}_i) + g_i(n, \alpha)} \leq \alpha \]
- \(g_i\) rather than our old \(g\) as a reminder that different model classes will have different bounds
Equivalently \[ \Prob{\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i) \geq g_i(n, \alpha)} \leq \alpha \]
\(\Rightarrow\) Generalization error bounds are also penalties for model selection

Vapnik’s “Structural Risk Minimization”

Have a nested series of models \(\ModelClass_1 \subset \ModelClass_2 \subset \ldots \ModelClass_q \subset\)
- Vapnik called this a “structure”, other people prefer “hierarchy”
\(\VCD(\ModelClass_i) =d_i < \infty\)
- But \(\VCD\) of the whole collection might be \(\infty\)
Each \(\ModelClass_i\) has generalization bound \(g_i(n, \alpha)\)
Pick the model class that optimizes the generalization bound
\(=\) minimize \[ \EmpRisk(\hat{s}_i) + g_i(n, \alpha) \]
\(=\) use \(g_i(n, \alpha)\) as the penalty

What does this typically look like?

A cartoon picture:

What does this typically look like? (2)

What happens when we get more data:

The VC bound penalties are \(O(\sqrt{\frac{\log{n/(d_i\alpha)}}{n/d_i}})\) so the penalties get weaker as \(n\) grows

SRM, slightly more concretely

Fix a hierarchy of models \(\ModelClass_1 \subset \ModelClass_2 \subset \ldots\)
Fix a maximum order \(q(n)\) (in advance of the data)
Fix \(\alpha_n \rightarrow 0\) (in advance of the data)
Get the data and find \(\hat{s}_i\), \(\EmpRisk(\hat{s}_i)\), \(g_i(n, \alpha)\), for \(i \in 1:q(n)\)
Return \(\hat{k} = \argmin_{i \in 1:q(n)}{\EmpRisk(\hat{s}_i) + g_i(n, \alpha)}\), and \(\hat{s}_{\hat{k}}\)
Claim: if \(q(n)\) grows slowly enough, and \(\alpha_n\) shrinks slowly enough, then \(\Risk(\hat{s}_{\hat{k}}) \rightarrow \min_{s \in \bigcup{\ModelClass_i}}{\Risk(s)}\) (with high probability)
- See backup slides for “slowly enough”

SRM, pros and cons

Directness: If we want to ensure that the selected model will generalize, why not select the model for guaranteed generalization?
With some work-in-advance, you can guarantee convergence to the best possible risk
- … within the hierarchy of models you consider
You do need to do that prep work
The tighter the generalization error bound, the better
- VC bounds are for the worst possible data generating distribution so they “conservative” (=cautious)
- Less conservative bounds may be more realistic and/or more forgiving of complexity
- You can still do SRM with (say) Rademacher or algorithmic-stability bounds
Cross-validation and holdout just need us to be able to fit the model and randomly split the data, no thinking required
- Human thought can be much more expensive than machine time or even human programming

Summing up

We can think of model selection as yet another penalized optimization problem
- Optimize best fit \(\EmpRisk(\hat{s}_i)\) \(+\) penalty for model complexity
The ideal penalty would be \(\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)\)
AIC, optimism, and even CV and holdout are ways of estimating / approximating this penalty
Generalization error bounds are directly about \(\Risk(\hat{s}_i) - \EmpRisk(\hat{s}_i)\)
Structural risk minimization is selecting the model class that minimizes \(\EmpRisk\) \(+\) generalization bound
SRM converges on the risk of the best model in the whole hierarchy
SRM requires more, and more mathematical, prep work than just doing CV
Next time: could we avoid having to select a model at all?

Backup: Filling in a bit of detail on SRM

Each \(\ModelClass_i\) has a bound that’s violated with probability at most \(\alpha_n\)
We’re looking at \(q(n)\) models so the probability that any of them break their bounds is at most \(q(n) \alpha_n\)
We want to make sure that:
1. The probability that any bound gets broken \(\rightarrow 0\)
2. The bound for each model class \(\rightarrow 0\), and
1. means that \(q(n) \alpha_n \rightarrow 0\)
As for (2), for \(\ModelClass_i\), \(g_i(n,\alpha)\) will be on the order of \(\frac{\log{n/d_i \alpha_n}}{n/d_i}\)
We want this to be \(o(1)\) even for the largest model, so we need \[ \log{\frac{n}{d_{q(n)} \alpha_n} = o(n/d_{q(n)}) \]
\(\log{(n/d_{q(n)})} = o(n/d_{q(n)})\) requires \(n/d_{q(n)} \rightarrow \infty\), or \(d_{q(n)} = o(n)\)
So we also need \[ -\log{\alpha_n} = o(n/d_{q(n)}) \] which we could say as \(\alpha_n = \myexp{-o(n/d_{q(n)})}\)
Going back to (1), we also need \(q(n) \alpha_n \rightarrow 0\), so \(q(n)\) needs to be small compared to \(\myexp{o(n/d_{q(n)})}\)
The maximum VC dimension has to grow slower than \(n\), so slowly that samples-per-dimension \(\rightarrow \infty\), and \(\alpha\) has to shrink at a rate that’s less than exponential in data per dimension

Model Selection II, Mostly Structural Risk Minimization