\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \]

Housekeeping

HW 1 is due this evening at 6 pm (Pittsburgh time)
Also be sure to do HW 0 about course policies (on Canvas, auto-graded)

In our last episode

True risk \(r(s) \equiv \Expect{\Loss(Y, s(X))}\)
Empirical risk \(\EmpRisk(s) \equiv n^{-1}\sum_{i=1}^{n}{\Loss(y_i, s(x_i))}\)
Empirical risk minimization: find \(\hat{s}\) that minimizes \(\hat{r}\) and use it to predict
With IID data, \(\EmpRisk(s)\) is an unbiased estimate of \(\Risk(s)\) for any fixed \(s\)
and \(\EmpRisk(s) \rightarrow \Risk(s)\) as \(n\rightarrow\infty\) (law of large numbers, again for fixed \(s\))
and \(\Var{\EmpRisk(s)} = O(1/n)\) (again for fixed \(s\))
“Parameterize” strategies with vector \(\theta\); then \(\hat{\theta} \rightarrow \theta^*\), and usually \(\hat{\theta} = \theta^* + O(1/\sqrt{n})\)
But parameters don’t matter, predictions do…

Reminders about “the usual asymptotics”

Upshots of the Taylor expansion song and dance from last time:

\[\begin{eqnarray} \hat{\theta} & \rightarrow & \theta^* \\ \hat{\theta} & \approx & \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)\\ \mathbf{k} & \equiv & \nabla \nabla \Risk(\theta^*)\\ \Var{\hat{\theta}} & \approx & n^{-1} \mathbf{k}^{-1} \mathbf{j} \mathbf{k}^{-1}\\ \mathbf{j} & \equiv & \Var{\nabla \Loss(Y, s(X;\theta^*))}\\ \hat{\theta} & \rightsquigarrow & \mathcal{N}(\theta^*, n^{-1} \mathbf{k}^{-1} \mathbf{j} \mathbf{k}^{-1}) \end{eqnarray}\]

How do we find those magic matrices?

Approximations/estimates, since \(\EmpRisk \rightarrow \Risk\) and \(\hat{\theta} \rightarrow \theta^*\): \[\begin{eqnarray} \mathbf{k} & \equiv & \nabla\nabla \Risk(\theta^*)\\ & \approx & \nabla \nabla \Risk(\hat{\theta})\\ & \approx & \nabla \nabla \EmpRisk(\hat{\theta})\\ \mathbf{j} & \equiv & \Var{\nabla \Loss(Y, s(X;\theta^*))}\\ & \approx & \Var{\nabla \Loss(Y, s(X;\hat{\theta}))}\\ & \approx & \frac{1}{n}\sum_{i=1}^{n}{\left( \nabla \Loss(y_i, s(x_i;\hat{\theta}))\right) \left( \nabla \Loss(y_i, s(x_i;\hat{\theta}))\right)^T} \end{eqnarray}\]
- (last equation expression \(=\) sample variance-covariance matrix for \(\nabla \Loss\))
Just need to compute \(\nabla \Loss\) and \(\nabla \nabla \Loss\)
- Often we did that to find \(\hat{\theta}\) in the first place
You’ll work through an example in HW 2

But what about the predictions?

\(\theta^*\) is the best parameter value, so best predictions are \(s(x;\theta^*) = s^*(x)\)
- “Best” = risk-minimizing = expected loss minimizing
\(\hat{\theta}\) estimates \(\theta^*\), so \(s(x;\hat{\theta}) = \hat{s}(x)\) estimates the best prediction
Uncertainty about \(\theta^*\) \(\Rightarrow\) uncertainty about best predictions
Variance of \(\hat{\theta}\) \(\Rightarrow\) variance for best predictions
- “Propagation of error”, “uncertainty propagation”, “the delta method”
- Another problem in HW 2
- More fun with \(\nabla\)s and matrices

But what about the predictions? (2)

To preview the homework a little, use a Taylor expansion: \[\begin{eqnarray} s(x;\hat{\theta}) & \approx & s(x;\theta^*) + \nabla s(x;\theta^*) (\hat{\theta} - \theta^*)\\ & = & s(x;\theta^*) + \nabla s(x;\theta^*) O(1/\sqrt{n})\\ & = & s(x;\theta^*) + O(1/\sqrt{n}) \rightarrow s(x;\theta^*)\\ \end{eqnarray}\]

We’ll be more precise in the homework, rather than just \(O(1/\sqrt{n})\)

What about the risk?

Taylor expand again: \[\begin{eqnarray} \Risk(\hat{\theta}) & \approx & \Risk(\theta^*) + \frac{1}{2} (\hat{\theta} - \theta^*) \cdot \mathbf{k} (\hat{\theta} - \theta^*)\\ & = & \Risk(\theta^*) + O(1/\sqrt{n})O(1/\sqrt{n})\\ & = & \Risk(\theta^*) + O(1/n) \end{eqnarray}\]
Remember that in Lecture 2 we said \[ \Risk(s) = \Risk_0 + (\Risk(s^*) - \Risk_0) + (\Risk(s) - \Risk(s^*)) = \text{minimum risk} + \text{approximation error} + \text{estimation error} \]
Generically, estimation error for ERM is \(O(1/n)\)
- Details later today

Being more precise about the risk

For any fixed \(s\), \[ \Expect{\EmpRisk(s)} = \Risk(s) \]
Define \(\gamma(s) \equiv \EmpRisk(s) - \Risk(s)\), the deviation or fluctuation at \(s\)
\(\Expect{\gamma(s)} = 0\) for any fixed \(s\), and \(\Var{\gamma(s)} = O(1/n)\)
But what about \(\gamma(\hat{s})\)? \(\hat{s}\) is not a fixed strategy!!!

Empirical risk minimization is optimistic

Intuition: \(\hat{s}\) picked to fit this data, so it partly adapts to real patterns that will show up in new data, and partly to the accidents of the training data
Math: \[ \hat{s} = \argmin_{s \in S}{\left( \Risk(s) + \gamma(s) \right)} \]
Picking \(s\) to minimize \(\EmpRisk(s)\) is partly about finding the strategy with the smallest true risk, minimizing \(\Risk(s)\)
Picking \(s\) to minimize \(\EmpRisk(s)\) is partly about finding the luckiest strategy, one with very negative \(\gamma(s)\)
- Remember \(\Expect{\gamma(s)} = 0\) for each \(s\)
Implication: \(\Expect{\gamma(\hat{s})} < 0\)
Implication: \[\begin{eqnarray} \Expect{\EmpRisk(\hat{s})} & = & \Expect{\Risk(\hat{s})} + \Expect{\gamma(\hat{s})}\\ & < & \Expect{\Risk(\hat{s})} \end{eqnarray}\]
The strategy we pick by minimizing the in-sample loss will do worse on new data (on average)

A toy example (1)

\(Y \sim \mathcal{N}(7, 1)\), \(n=20\) data points (tick marks on the axis):

A toy example (2)

Use squared error, with \(\ModelClass =\) all constants, so we just want to predict the expected value of \(Y\)
True risk when predicting \(\theta\) is \(1+(\theta-7)^2\) (why?)
- What if \(Y\) is Gaussian but with a different variance? What if it was non-Gaussian but with the same expectation and variance?
True risk is minimized at the expected value:

A toy example (3)

Empirical risk is minimized at the sample mean:

A toy example (4)

The difference between the two curves is the risk deviation \(\gamma(\theta)\):
\(\gamma(\theta)\) is really a random function (a stochastic process), and this is one draw from its distribution (one realization of the process)

A toy example (5)

Repeat the simulation many times, to evaluate \(\gamma(\theta)\) for any fixed \(\theta\)
- At each \(\theta\), get a Gaussian distribution centered at 0 (why?)

A toy example (6)

Things look different for the \(\hat{\theta}\) chosen by empirical risk minimization:

Estimating the true risk from the empirical risk

We want to know \(\Risk(\hat{s})\), and we know \(\EmpRisk(\hat{s})\)
- Equivalently, \(\Risk(\hat{\theta})\) and \(\EmpRisk(\hat{\theta})\)
We saw a little while ago that \[ \Risk(\hat{\theta}) \approx \Risk(\theta^*) + \frac{1}{2} (\hat{\theta} - \theta^*) \cdot \mathbf{k} (\hat{\theta} - \theta^*)\\ \]
We don’t know \(\Risk(\theta^*)\) but \(\theta^*\) is fixed so \(\Risk(\theta^*) = \Expect{\EmpRisk(\theta^*)}\)
We don’t know \(\EmpRisk(\theta^*)\) but we can Taylor expand: \[\begin{eqnarray} \EmpRisk(\theta^*) & \approx & \EmpRisk(\hat{\theta}) + \frac{1}{2} (\theta^* - \hat{\theta}) \cdot \left( \nabla \nabla \EmpRisk(\hat{\theta}) (\theta^* - \hat{\theta})\right)\\ & = & \EmpRisk(\hat{\theta}) + \frac{1}{2} (\hat{\theta} - \theta^*) \cdot \left( \nabla \nabla \EmpRisk(\hat{\theta}) (\hat{\theta} - \theta^*)\right)\\ & \approx & \EmpRisk(\hat{\theta}) + \frac{1}{2} (\hat{\theta} - \theta^*) \cdot \left( \nabla \nabla \Risk(\hat{\theta}) (\hat{\theta} - \theta^*)\right)\\ & = & \EmpRisk(\hat{\theta}) + \frac{1}{2} (\hat{\theta} - \theta^*) \cdot \left( \mathbf{k} (\hat{\theta} - \theta^*)\right)\\ \end{eqnarray}\]

Estimating the true risk from the empirical risk (2)

Put the two Taylor series together \[ \Risk(\hat{\theta}) \approx \EmpRisk(\hat{\theta}) + (\hat{\theta} - \theta^*) \cdot \mathbf{k} (\hat{\theta} - \theta^*) \]
Now (see back-up) \[ \Expect{\Risk(\hat{\theta})} \approx \Expect{\EmpRisk(\hat{\theta})} + n^{-1}\tr{\left(\mathbf{j}\mathbf{k}^{-1}\right)} \]
- Remember \(\tr{\mathbf{a}} \equiv \sum_{i}{a_{ii}}=\) sum of the diagonal entries of \(\mathbf{a}\) \(=\) sum of the eigenvalues of \(\mathbf{a}\)
So we can estimate \[ \Risk(\hat{\theta}) \approx \EmpRisk(\hat{\theta}) + n^{-1}\tr{\mathbf{j}\mathbf{k}^{-1}} \]

Estimating the true risk from the empirical risk (3)

\[ \Risk(\hat{\theta}) \approx \EmpRisk(\hat{\theta}) + n^{-1}\tr{\mathbf{j}\mathbf{k}^{-1}} \]

An (asymptotically) unbiased estimate of the true risk based on the empirical risk
- Still true even if we’re more careful in our math
The extra term is always \(> 0\)
The extra term measures how “flat” the minimum is (\(\mathbf{k}^{-1}\)) and how much noise there is in the function we’re optimizing (\(\mathbf{j}\))
Lots of other formulas are special cases of this:
- Akaike information criterion (AIC) for model selection with maximum likelihood
  - If we use the log loss and our model is right, then \(\tr{(\mathbf{j}\mathbf{k}^{-1})} =\) number of dimensions in \(\theta\)
- Mallows \(C_p\) for linear regression, \(\EmpRisk(\hat{\theta}) + 2\sigma^2 p/n\) (when we estimate \(p\) coefficients and the true noise around the regression line has variance \(\sigma^2\))
- Generalized Mallows \(C_p\) for linear smoothers, \(\EmpRisk(\hat{s}) + 2\sigma^2 \mathrm{df}(\hat{s})/n\)
  - \(\mathrm{df}(\hat{s}) \equiv \sum_{i=1}^{\sigma^2}{\Cov{Y_i, s(X_i, \hat{\theta})}}\)
The special cases let us see the \(\tr{(\mathbf{j}\mathbf{k}^{-1})}\) term is something like “how flexible is the model?”

The moral of all this math: approximation-estimation trade-off

Remember how we broke up the risk in Lecture 2: \[\begin{eqnarray} \Risk(\hat{s}) & = & \Risk_0 + (\Risk(\OptimalModel) - \Risk_0) + (\Risk(\hat{s}) - \Risk(\OptimalModel))\\ & = & \text{(optimal risk)} + \text{(approximation error)} + \text{(estimation error)} \end{eqnarray}\]
What happens if we have two sets of strategies, \(\ModelClass_1\) and \(\ModelClass_2\), and \(\ModelClass_1 \subset \ModelClass_2\)?
Usually two optimal models, \(\OptimalModel_1 \in \ModelClass_1\) and \(\OptimalModel_2 \in \ModelClass_2\), and two estimates, \(\hat{s}_1\) and \(\hat{s}_2\)
Because \(\ModelClass_1 \subset \ModelClass_2\):
1. \(\EmpRisk(\hat{s}_1) \geq \EmpRisk(\hat{s}_2)\): empirical risk can only get better by optimizing over more strategies
2. \(\Risk(\OptimalModel_1) \geq \Risk(\OptimalModel_2)\): true risk can only get better by optimizing over more strategies
3. \(\max_{s\in\ModelClass_1}{|\gamma(s)|} \leq \max_{s\in\ModelClass_2}{|\gamma(s)|}\): the maximum deviation can only get bigger by searching over more strategies
Thing (1) means ERM always prefers the bigger, more flexible model class
Thing (2) means that bigger, more flexible model classes will have smaller approximation error
Thing (3) means that bigger, more flexible model classes will have bigger estimation error

Over-fitting

Over-fitting = fitting a model that’s bigger (more flexible, more powerful) than the one which will predict best (given your data)
Over-fitting happens because optimism increases with model size
- Equivalently, because estimation error grows with model size
Some amount of optimism is built in to empirical risk minimization
- We asked the strategy to optimize performance on the training data, and it did so
ERM is very prone to over-fitting

Avoiding over-fitting

Good practical advice: constrain the strategies using what you know (or believe…) about the situation
Good practical advice: keep it simple
- Tension: sometimes we know things are complicated (e.g. language modeling)
Impose penalties, so we minimize \(\EmpRisk(s) + g(s)\) for some \(g(s)\) which is something like \(\Expect{|\gamma(s)|}\)
- Don’t take that too literally; we’ll come back to penalties later
Get tighter control on over-fitting

Why the optimism is not the end of the story