Stochastic Optimization

36-350
29 October 2014

Agenda

Why there's such a thing as too much data
How to make noise our friend
How to make slop our friend

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\ER}{f} \newcommand{\TrueR}{f_0} \newcommand{\ERM}{\hat{\theta}_n} \newcommand{\EH}{\widehat{\mathbf{H}}_n} \newcommand{\tr}[1]{\mathrm{tr}\left( #1 \right)} \]

Optional reading: Bottou and Bosquet, “The Tradeoffs of Large Scale Learning”

Problems with Big Data

Typical statistical objective function, mean-squared error: \[ f(\theta) = \frac{1}{n}\sum_{i=1}^{n}{{\left( y_i - m(x_i,\theta)\right)}^2} \]
Getting a value of \( f \) is \( O(n) \), \( \nabla f \) is \( O(np) \), \( \mathbf{H} \) is \( O(np^2) \)
- worse still if \( m \) slows down with \( n \)
Not bad when \( n=100 \) or even \( n=10^4 \), but if \( n={10}^9 \) or \( n={10}^{12} \) we don't even know which way to move

Problems with Big Data

sarcastic gradient descent

Sampling, an Alternative to Sarcastic Gradient Descent

Pick one data point \( I \) at random (uniform on \( 1:n \))
Loss there, \( {\left( y_I - m(x_I,\theta)\right)}^2 \), is random, but

\[ \Expect{{\left( y_I - m(x_I,\theta)\right)}^2} = f(\theta) \]

Generally, if \( f(\theta) = n^{-1}\sum_{i=1}^{n}{f_i(\theta)} \) and \( f_i \) are well-behaved, \[ \begin{eqnarray*} \Expect{f_I(\theta)} & = & f(\theta)\\ \Expect{\nabla f_I(\theta)} & = & \nabla f(\theta)\\ \Expect{\nabla^2 f_I(\theta)} & = & \mathbf{H}(\theta) \end{eqnarray*} \]

\( \therefore \) Don't optimize with all the data, optimize with random samples

Stochastic Gradient Descent

Draw lots of one-point samples, let their noise cancel out:

Start with initial guess \( \theta \), learning rate \( \eta \)
While ((not too tired) and (making adequate progress))
- At \( t^{\mathrm{th}} \) iteration, pick random \( I \) uniformly
- Set \( \theta\leftarrow \theta - t^{-1}\eta\nabla f_I(\theta) \)
Return final \( \theta \)

Shrinking step-size by \( 1/t \) ensures noise in each gradient dies down

(Variants: put points in some random order, only check progress after going over each point once, adjust \( 1/t \) rate, average a couple of random data points (“mini-batch”), etc.)

Stochastic Gradient Descent

stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) {
  for (t in 1:max.iter) {
    g <- stoch.grad(f,theta,df)
    theta <- theta - (rate/t)*g
  }
  return(x)
}

stoch.grad <- function(f,theta,df) {
  stopifnot(require(numDeriv))
  i <- sample(1:nrow(df),size=1)
  noisy.f <- function(theta) { return(f(theta, data=df[i,])) }
  stoch.grad <- grad(noisy.f,theta)
  return(stoch.grad)
}

Stochastic Newton's Method

a.k.a. 2nd order stochastic gradient descent

Start with initial guess \( \theta \)
While ((not too tired) and (making adequate progress))
- At \( t^{\mathrm{th}} \) iteration, pick uniformly-random \( I \)
- \( \theta \leftarrow \theta - t^{-1}\mathbf{H}_{I}^{-1}(\theta) \nabla f_{I}(\theta) \)
Return final \( \theta \)

+ all the Newton-ish tricks to avoid having to recompute the Hessian

Stochastic Gradient Methods

Pros:
- Each iteration is fast (and constant in \( n \))
- Never need to hold all data in memory
- Does converge eventually
Cons:
- Noise does reduce precision — more iterations to get within \( \epsilon \) of optimum than non-stochastic GD or Newton

Often low computational cost to get within statistical error of the optimum

How Precise?

We're minimizing \( f \) and aiming at \( \hat{\theta} \)
\( f \) is a function of the data, which are full of useless details
- e.g., \( f(\theta) = n^{-1}\sum_{i=1}^{n}{(y_i-m(x_i;\theta))^2} \)
We hope there's some true \( f_0 \), with minimum \( \theta_0 \)
- e.g., \( f_0(\theta) = \Expect{(Y-m(X;\theta))^2} = \int{(y-m(x;\theta))^2 p(x,y) dx dy} \)
but we know \( f \neq f_0 \)
- because we've only got a finite sample, which includes noise
Past some point, getting a better \( \hat{\theta} \) isn't helping is find \( \theta_0 \)

(why push optimization to \( \pm {10}^{-6} \) if \( f \) only matches \( f_0 \) to \( \pm 1 \)?)

How Precise?

An example
- true model: \( Y=X + \) Gaussian noise with mean 0, s.d. 0.1, \( X \) uniform between 0 and 1
- Try regression through the origin (\( Y=bX \), \( b \) unknown)
We'll plot the population objective function, and a bunch of sample objective functions from different \( n=30 \) draws

f0 <- function(b) { 0.1^2+ (1/3)*(b-1)^2 }
f <- Vectorize(FUN=function(b,df) { mean((df$y - b*df$x)^2) }, vectorize.args="b")
simulate_df <- function(n) {
  x <- runif(n)
  y<-x+rnorm(n,0,0.1)
  return(data.frame(x=x,y=y))
}

How Precise?

curve(f0(b=x), from=0,to=2,)
replicate(100,curve(f(b=x,df=simulate_df(30)),
  add=TRUE,col="grey",lwd=0.1))

plot of chunk unnamed-chunk-3

How Precise?

Pretty generally:
- \( \hat{\theta} - \theta_0 = O(1/\sqrt{n}) \) at best
- \( f(\hat{\theta}) - f(\theta_0) = O(1/n) \) at best
- see bonus slides for gory details
Not much point in pushing an optimization past those limits
- More work gets a better approximation to \( \hat{\theta} \)…
- but the numerical precision is swamped by the statistical noise…
- so the answer isn't really any more accurate or precise than if we'd stopped sooner

How Precise?

We shouldn't care about differences in \( \theta \) much smaller than \( O(1/\sqrt{n}) \)…
- the tol of the optimization
but what's the constant multiplying \( 1/\sqrt{n} \)?
Could do some heroic calculus + linear algebra
- “sandwich covariance matrix” (see bonus slides; the answer involves the Hessian)
Can use tricks like the jackknife to get an idea of what the statistical uncertainty is
- though the bootstrap is usually better than the jackknife (take 36-402)

Summary

Stochastic gradient descent etc. deliberately use samples of the data, rather than all of it at once
- Gives up precision for speed and memory
- Make fitting models to enormous data sets computationally tractable
Since the function we're optimizing is noisy anyway, don't bother pushing the numerical precision much below the statistical uncertainty
- Can estimate uncertainty by statistical theory
- or by re-sampling methods

Asymptotics of Optimization

Have \( \ER \) (sample objective) but want to minimize \( \TrueR \) (population objective)

If \( \ER \) is an average over data points, then (law of large numbers) \[ \Expect{\ER(\theta)} = \TrueR(\theta) \] and (central limit theorem) \[ \ER(\theta) - \TrueR(\theta) = O(n^{-1/2}) \]

Asymptotics of Optimization

Do the opposite expansion to the one we used to derive Newton's method: \[ \begin{eqnarray*} \ERM & = & \argmin_{\theta}{\ER(\theta)}\\ \nabla \ER(\ERM) & = & 0\\ &\approx & \nabla \ER(\theta^*) + \EH(\theta^*)(\ERM-\theta^*)\\ \ERM & \approx & \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*) \end{eqnarray*} \]

Asymptotics of Optimization

\[ \ERM \approx \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*) \]

When does \( \EH^{-1}(\theta^*)\nabla\ER(\theta^*) \rightarrow 0 \)?

\[ \begin{eqnarray*} \EH(\theta^*) & \rightarrow & \mathbf{H}(\theta^*) ~ \mathrm{(by\ LLN)}\\ \nabla\ER(\theta^*) - \nabla f(\theta^*) & = & O(n^{-1/2}) ~ \mathrm{(by\ CLT)} \end{eqnarray*} \]

but \( \nabla f(\theta^*) = 0 \)

\[ \begin{eqnarray*} \therefore \nabla\ER(\theta^*) & = & O(n^{-1/2})\\ \Var{\nabla \ER(\theta^*)} & \rightarrow & n^{-1} \mathbf{K}(\theta^*) ~ \mathrm{(CLT\ again)} \end{eqnarray*} \]

Asymptotics of Optimization

How much noise is there in \( \ERM \)?

\[ \begin{eqnarray*} \Var{\ERM} & = & \Var{\ERM-\theta^*}\\ & = & \Var{\EH^{-1}(\theta^*) \nabla\ER(\theta^*)}\\ & = & \EH^{-1}(\theta^*)\Var{\nabla\ER(\theta^*)} \EH^{-1}(\theta^*)\\ & \rightarrow & n^{-1} \mathbf{H}^{-1}(\theta^*) \mathbf{K}(\theta^*) \mathbf{H}^{-1}(\theta^*) \\ & = &O(pn^{-1}) \end{eqnarray*} \]

so \( \ERM - \theta^* = O(1/\sqrt{n}) \)

Asymptotics of Optimization

How much noise is there in \( \TrueR(\ERM) \)?

\[ \begin{eqnarray*} \TrueR(\ERM) - \TrueR(\theta^*) & \approx & \frac{1}{2}(\ERM-\theta^*)^T \mathbf{H}(\theta^*) (\ERM-\theta^*)\\ \Expect{f(\ERM) - f(\theta^*)} & \approx & \frac{1}{2}\tr{\Var{(\ERM-\theta^*)}\mathbf{H}(\theta^*)} + \frac{1}{2} \Expect{\ERM -\theta^*}^T \mathbf{H}(\theta^*) \Expect{\ERM -\theta^*}\\ &= & \frac{1}{2}\tr{n^{-1} \mathbf{H}^{-1}(\theta^*) \mathbf{K}(\theta^*) \mathbf{H}^{-1}(\theta^*)\mathbf{H}(\theta^*) }\\ & = & \frac{1}{2}n^{-1}\tr{\mathbf{H}^{-1}(\theta^*)\mathbf{K}(\theta^*)}\\ \Var{\TrueR(\ERM)-\TrueR(\theta^*)} & \approx & \tr{\left(\mathbf{H}(\theta^*) \Var{\ERM-\theta^*} \mathbf{H}(\theta^*) \Var{\ERM-\theta^*}\right)}\\ & \rightarrow & n^{-2} \tr{\left(\mathbf{K}(\theta^*)\mathbf{H}^{-1}(\theta^*)\mathbf{K}(\theta^*)\mathbf{H}^{-1}(\theta^*)\right)}\\ & = & O(pn^{-2}) \end{eqnarray*} \]

Asymptotics of Optimization

The ideal case is well-specified maximum likelihood: then \( \mathbf{K} = \mathbf{H} \), and

\[ \begin{eqnarray*} \ERM & \approx & \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*)\\ \Expect{f(\ERM) - f(\theta^*)} & \approx & \frac{1}{2}n^{-1} p\\ \Var{\ERM} &\approx & n^{-1} \mathbf{H}^{-1}(\theta^*) \approx n^{-1} \mathbf{H}(\ERM) \\ \Var{f(\ERM)-f(\theta^*)} & \approx & n^{-2} p \end{eqnarray*} \]