Empirical Risk Minimization

36-465/665, Spring 2021

9 February 2021 (Lecture 3)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \]

In our previous episode

Decision problem with available actions \(A\), state \(Y\), information \(X\), loss function \(\Loss(y,a)\)
Set \(S\) of “available” strategies, each \(s \in S\) is a rule saying what action to take for each value of the information \(x\)
Risk of a strategy is its expected loss, \(\Risk(s) \equiv \Expect{\Loss(Y, s(X))}\)
Want to minimize the risk, \(s^* = \argmin_{s \in S}{\Risk(s)}\)

The central difficulty and a way out

True risk is an expectation, which needs the true data-generating distribution
Problem: We don’t know that distribution
Resource: we have training data \((x_1, y_1), \ldots (x_n, y_n)\)
Approach: empirical risk \[ \EmpRisk(s) \equiv \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i))} \]
- “For each training case, what action would the strategy \(s\) take, and how well would that have worked out?”

Empirical risk converges on true risk

Assume the \((X_i, Y_i)\) are IID
Remember the law of large numbers: if \(Z_i\) are IID, and \(h\) is any fixed function (where \(\Expect{h(Z)}\) exists), then \[ \frac{1}{n}\sum_{i=1}^{n}{h(Z_i)} \rightarrow \Expect{h(Z)} \]
For a fixed strategy \(s\), \[ \EmpRisk(s) \rightarrow \Risk(s) \]
The central limit theorem also applies: \[ \EmpRisk(s) \rightsquigarrow \mathcal{N}(\Risk(s), \sigma^2_{s}/n) \] where \(\sigma^2_{s} \equiv \Var{\Loss(Y, s(X))}\)

Empirical risk minimization

Empirical risk minimizer: \[ \hat{s} = \argmin_{s \in S}{\EmpRisk(s)} \]
Empirical risk minimization: Find \(\hat{s}\) and use it

You have already been using ERM

Using ordinary least squares to fit a linear regression
- Or a polynomial, etc., regression
Using maximum likelihood to estimate a probability distribution or a regression model
- See last problem in HW 1
Using nonlinear least squares in regression
On the other hand some techniques from 402 or 462 are not ERM (e.g. nearest neighbors)

Two issues with ERM

Does it work?
- How close is the empirical risk minimizer to the true risk minimizer?
- Does the ERM strategy approach the optimal one as \(n\rightarrow\infty\)?
- If so, how fast?
- How big is ERM’s estimation error?
How do we implement it?
- How do we actually find the minimum-risk strategy on a computer?
- How do we get the computer to solve the minimization problem fast?

We’ll come back to (2) around lecture 10
For today let’s focus on (1)

“The usual asymptotics”

Big idea: empirical risk \(=\) true risk plus noise that gets small with \(n\)
\(\Rightarrow\) ERM \(=\) true risk minimizer plus noise that gets small with \(n\)
- Also called “small-noise asymptotics”
We’ll use some calculus to get
- Convergence of \(\hat{s}\) to \(s^*\)
- Fluctuations of \(\hat{s}\)

Calculus reminders about optimization

We want to minimize \(f(\theta)\)
Start with \(\theta\) being one-dimensional
First derivative \(f^{\prime}\), second derivative \(f^{\prime\prime}\)
If \(f^{\prime}(\theta^*) = 0\) (first order condition),
\(f^{\prime\prime}(\theta^*) > 0\) (second order condition)
then \(\theta^*\) is a minimum
Quibbles:
- Maybe just a local (not global) minimum
- Can be a minimum where \(f^{\prime\prime}(\theta^*) = 0\) (like \(\theta^4\) at 0)
- Can be a minimum on the boundary even if \(f^{\prime} \neq 0\) (as in HW 1)
Nonetheless, a generic (local) minimum \(\theta^*\) is in the interior and has \(f^{\prime}(\theta^*) = 0\), \(f^{\prime\prime}(\theta^*) > 0\)

Calculus reminders about optimization (2)

Taylor expansion: \[\begin{eqnarray} f(\theta) & = & f(\theta^*) + f^{\prime}(\theta^*) (\theta-\theta^*) + \frac{1}{2}f^{\prime\prime}(\theta^*) (\theta-\theta^*)^2 + o((\theta-\theta^*)^2)\\ & = & f(\theta^*) + \frac{1}{2}f^{\prime\prime}(\theta^*) (\theta-\theta^*)^2 + o((\theta-\theta^*)^2) \end{eqnarray}\]
- “Local minima generically look like parabolas”

Multivariable calculus reminders about optimization

With \(\theta\) multi-dimensional, we need all partial derivatives to be 0, or gradient 0: \[ \nabla f(\theta^*) = 0 \]
Role of \(f^{\prime\prime}\) is played by the matrix of 2nd partial derivatives or Hessian matrix, \(\nabla \nabla f\)
- \(\nabla \nabla f(\theta^*)\) should be a positive-definite matrix, meaning \(\vec{v} \cdot \left( \nabla \nabla f(\theta^*) \vec{v} \right) > 0\) for any vector \(\vec{v} \neq 0\)
We get \[\begin{eqnarray} f(\theta) & \approx & f(\theta^*) + \frac{1}{2} (\theta - \theta^*) \cdot \left( \nabla \nabla f(\theta^*) (\theta-\theta^*)\right)\\ \end{eqnarray}\]
- “Every slice through a generic local minimum looks like a parabola”

Back to ERM

\[ \hat{s} = \argmin_{s}{\EmpRisk(s)} \] - On the other hand \[ s^* = \argmin_{s}{\Risk(s)} \]

We want to see whether / how fast \(\hat{s}\) approaches \(s^*\)
Each strategy \(s\) corresponds to a parameter vector \(\theta\)
If \(\hat{\theta} \rightarrow \theta^*\) then \(\hat{s} \rightarrow s^*\)
- Unless we’re stupid/perverse about parameterizing the rules
Convergence of vectors is easier to understand!

The empirical risk minimizer is the true minimizer plus noise

\(\hat{\theta}\) minimizes \(\EmpRisk\): \[\begin{eqnarray} 0 & = & \nabla \EmpRisk(\hat{\theta})\\ & \approx & \nabla \EmpRisk(\theta^*) + \nabla \nabla \EmpRisk(\theta^*) (\hat{\theta} - \theta^*) ~ (\text{Taylor expand around}\ \theta^*)\\ \hat{\theta} & \approx & \theta^* - \left( \nabla \nabla \EmpRisk(\theta^*) \right)^{-1} \nabla \EmpRisk(\theta^*) \end{eqnarray}\]
\(\theta^*\) is fixed, so \[\begin{eqnarray} \EmpRisk(\theta^*) & \rightarrow & \Risk(\theta^*) ~ \text{(LLN)}\\ \nabla \EmpRisk(\theta^*) & \rightarrow & \nabla \Risk(\theta^*) ~ \text{(usually)}\\ \nabla \nabla \EmpRisk(\theta^*) & \rightarrow & \nabla \nabla \Risk(\theta^*) \equiv \mathbf{k} ~ \text{(usually)} \end{eqnarray}\]
So \[ \hat{\theta} \approx \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*) \]

Asymptotic convergence and unbiasedness

\[ \hat{\theta} \approx \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*) \]

But \(\nabla \EmpRisk(\theta^*) \rightarrow \nabla \Risk(\theta^*) = 0\)
So \(\hat{\theta} \rightarrow \theta^*\) (asymptotic convergence)
Usually \(\Expect{\nabla \EmpRisk} = \nabla \Expect{\EmpRisk} = \nabla\Risk\), so also \[ \Expect{\hat{\theta}} \approx \theta^* - 0 = \theta^* \]
\(\therefore\) \(\hat{\theta}\) is (asymptotically) unbiased

Sandwich covariance

\[\begin{eqnarray} \hat{\theta} & \approx & \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)\\ \Var{\hat{\theta}} & \approx & \Var{\theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)}\\ & = & \Var{\mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)}\\ & = & \mathbf{k}^{-1} \Var{\nabla \EmpRisk(\theta^*)} \mathbf{k}^{-1} \end{eqnarray}\]

This is the “sandwich covariance matrix” for \(\hat{\theta}\)
- a.k.a. “sandwich variance”

Sandwich covariance (2)

Remember \(\EmpRisk\) is a sample average, so \[\begin{eqnarray} \Var{\nabla \EmpRisk(\theta^*)} & = & \Var{\nabla \left(\frac{1}{n}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i;\theta))} \right)} \\ & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{\nabla \Loss(Y_i, s(X_i;\theta))}}\\ & = & \frac{1}{n^2}n\Var{\nabla \Loss(Y, s(X;\theta))}\\ & \equiv & \frac{\mathbf{j}}{n} \end{eqnarray}\] so \[ \Var{\hat{\theta}} \approx \frac{1}{n} \mathbf{k}^{-1} \mathbf{j} \mathbf{k}^{-1} \]

How far is \(\hat{\theta}\) from \(\theta^*\)?

Just saw \(\hat{\theta}\) is asymptotically unbiased
\(\Var{\hat{\theta}} = O(1/n)\)
So \(\Expect{(\hat{\theta} - \theta^*)^2} = O(1/n)\)
So \[ \hat{\theta} = \theta^* + O(1/\sqrt{n}) \]

Gaussian fluctuations

The CLT holds for \(\EmpRisk(\theta)\) at any fixed \(\theta\), so usually a CLT for \(\nabla \EmpRisk\): \[ \nabla \EmpRisk(\theta^*) \rightsquigarrow \mathcal{N}(0, \mathbf{j}/n) \]
So: \[ \hat{\theta} \rightsquigarrow \mathcal{N}(\theta^*, n^{-1} \mathbf{k}^{-1} \mathbf{j} \mathbf{k}^{-1}) \]

2 cheers for ERM

Hooray! ERM converges on the best-in-class parameters \(\theta^*\)
Hooray! ERM converges pretty fast in parametric problems, \(O(1/\sqrt{n})\)

Some reasons to be a bit more hesitant

Who cares about \(\theta^*\)? We care about \(s^*\)!
Is \(\Risk(\hat{s})\) converging on \(\Risk(s^*)\)? If so, how quickly?
Is \(\EmpRisk(\hat{s})\) a good estimate of \(\Risk(\hat{s})\)? How much worse could the true risk be than the empirical risk?
Can we guarantee anything about \(\Risk(\hat{s})\), without having to wait for the asymptotics to kick in?
- Do the asymptotics kick in at \(n=60\) or \(n=6000\) or \(n=6\times{10}^{23}\)?
We will start tackling the issue of “how good is the optimum we’ve found anyway?” next time
Making some non-asymptotic guarantees will pre-occupy us for a couple of weeks after that

Backup: generic minima look, locally, like parabolas

\(f(\theta)\) (solid) vs. \(f(\theta^*) + \frac{1}{2}f^{\prime\prime}(\theta^*) (\theta-\theta^*)^2\) (dashed) around the local minimum \(\theta^*\)