Empirical Risk Minimization

36-465/665, Spring 2021

9 February 2021 (Lecture 3)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \]

In our previous episode

The central difficulty and a way out

Empirical risk converges on true risk

Empirical risk minimization

You have already been using ERM

Two issues with ERM

  1. Does it work?
    • How close is the empirical risk minimizer to the true risk minimizer?
    • Does the ERM strategy approach the optimal one as \(n\rightarrow\infty\)?
    • If so, how fast?
    • How big is ERM’s estimation error?
  2. How do we implement it?
    • How do we actually find the minimum-risk strategy on a computer?
    • How do we get the computer to solve the minimization problem fast?

“The usual asymptotics”

Calculus reminders about optimization

Calculus reminders about optimization (2)

Multivariable calculus reminders about optimization

Back to ERM

\[ \hat{s} = \argmin_{s}{\EmpRisk(s)} \] - On the other hand \[ s^* = \argmin_{s}{\Risk(s)} \]

The empirical risk minimizer is the true minimizer plus noise

Asymptotic convergence and unbiasedness

\[ \hat{\theta} \approx \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*) \]

Sandwich covariance

\[\begin{eqnarray} \hat{\theta} & \approx & \theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)\\ \Var{\hat{\theta}} & \approx & \Var{\theta^* - \mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)}\\ & = & \Var{\mathbf{k}^{-1} \nabla \EmpRisk(\theta^*)}\\ & = & \mathbf{k}^{-1} \Var{\nabla \EmpRisk(\theta^*)} \mathbf{k}^{-1} \end{eqnarray}\]

Sandwich covariance (2)

How far is \(\hat{\theta}\) from \(\theta^*\)?

Gaussian fluctuations

2 cheers for ERM

  1. Hooray! ERM converges on the best-in-class parameters \(\theta^*\)
  2. Hooray! ERM converges pretty fast in parametric problems, \(O(1/\sqrt{n})\)

Some reasons to be a bit more hesitant

Backup: generic minima look, locally, like parabolas

\(f(\theta)\) (solid) vs. \(f(\theta^*) + \frac{1}{2}f^{\prime\prime}(\theta^*) (\theta-\theta^*)^2\) (dashed) around the local minimum \(\theta^*\)