36-350
29 October 2014
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\ER}{f} \newcommand{\TrueR}{f_0} \newcommand{\ERM}{\hat{\theta}_n} \newcommand{\EH}{\widehat{\mathbf{H}}_n} \newcommand{\tr}[1]{\mathrm{tr}\left( #1 \right)} \]
Optional reading: Bottou and Bosquet, “The Tradeoffs of Large Scale Learning”
Typical statistical objective function, mean-squared error: \[ f(\theta) = \frac{1}{n}\sum_{i=1}^{n}{{\left( y_i - m(x_i,\theta)\right)}^2} \]
Getting a value of \( f \) is \( O(n) \), \( \nabla f \) is \( O(np) \), \( \mathbf{H} \) is \( O(np^2) \)
Not bad when \( n=100 \) or even \( n=10^4 \), but if \( n={10}^9 \) or \( n={10}^{12} \) we don't even know which way to move
Pick one data point \( I \) at random (uniform on \( 1:n \))
Loss there, \( {\left( y_I - m(x_I,\theta)\right)}^2 \), is random, but
\[ \Expect{{\left( y_I - m(x_I,\theta)\right)}^2} = f(\theta) \]
\( \therefore \) Don't optimize with all the data, optimize with random samples
Draw lots of one-point samples, let their noise cancel out:
Shrinking step-size by \( 1/t \) ensures noise in each gradient dies down
(Variants: put points in some random order, only check progress after going over each point once, adjust \( 1/t \) rate, average a couple of random data points (“mini-batch”), etc.)
stoch.grad.descent <- function(f,theta,df,max.iter=1e6,rate=1e-6) {
for (t in 1:max.iter) {
g <- stoch.grad(f,theta,df)
theta <- theta - (rate/t)*g
}
return(x)
}
stoch.grad <- function(f,theta,df) {
stopifnot(require(numDeriv))
i <- sample(1:nrow(df),size=1)
noisy.f <- function(theta) { return(f(theta, data=df[i,])) }
stoch.grad <- grad(noisy.f,theta)
return(stoch.grad)
}
a.k.a. 2nd order stochastic gradient descent
+ all the Newton-ish tricks to avoid having to recompute the Hessian
Pros:
Cons:
Often low computational cost to get within statistical error of the optimum
We're minimizing \( f \) and aiming at \( \hat{\theta} \)
\( f \) is a function of the data, which are full of useless details
We hope there's some true \( f_0 \), with minimum \( \theta_0 \)
but we know \( f \neq f_0 \)
Past some point, getting a better \( \hat{\theta} \) isn't helping is find \( \theta_0 \)
(why push optimization to \( \pm {10}^{-6} \) if \( f \) only matches \( f_0 \) to \( \pm 1 \)?)
f0 <- function(b) { 0.1^2+ (1/3)*(b-1)^2 }
f <- Vectorize(FUN=function(b,df) { mean((df$y - b*df$x)^2) }, vectorize.args="b")
simulate_df <- function(n) {
x <- runif(n)
y<-x+rnorm(n,0,0.1)
return(data.frame(x=x,y=y))
}
curve(f0(b=x), from=0,to=2,)
replicate(100,curve(f(b=x,df=simulate_df(30)),
add=TRUE,col="grey",lwd=0.1))
tol
of the optimizationHave \( \ER \) (sample objective) but want to minimize \( \TrueR \) (population objective)
If \( \ER \) is an average over data points, then (law of large numbers) \[ \Expect{\ER(\theta)} = \TrueR(\theta) \] and (central limit theorem) \[ \ER(\theta) - \TrueR(\theta) = O(n^{-1/2}) \]
Do the opposite expansion to the one we used to derive Newton's method: \[ \begin{eqnarray*} \ERM & = & \argmin_{\theta}{\ER(\theta)}\\ \nabla \ER(\ERM) & = & 0\\ &\approx & \nabla \ER(\theta^*) + \EH(\theta^*)(\ERM-\theta^*)\\ \ERM & \approx & \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*) \end{eqnarray*} \]
\[ \ERM \approx \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*) \]
When does \( \EH^{-1}(\theta^*)\nabla\ER(\theta^*) \rightarrow 0 \)?
\[ \begin{eqnarray*} \EH(\theta^*) & \rightarrow & \mathbf{H}(\theta^*) ~ \mathrm{(by\ LLN)}\\ \nabla\ER(\theta^*) - \nabla f(\theta^*) & = & O(n^{-1/2}) ~ \mathrm{(by\ CLT)} \end{eqnarray*} \]
but \( \nabla f(\theta^*) = 0 \)
\[ \begin{eqnarray*} \therefore \nabla\ER(\theta^*) & = & O(n^{-1/2})\\ \Var{\nabla \ER(\theta^*)} & \rightarrow & n^{-1} \mathbf{K}(\theta^*) ~ \mathrm{(CLT\ again)} \end{eqnarray*} \]
How much noise is there in \( \ERM \)?
\[ \begin{eqnarray*} \Var{\ERM} & = & \Var{\ERM-\theta^*}\\ & = & \Var{\EH^{-1}(\theta^*) \nabla\ER(\theta^*)}\\ & = & \EH^{-1}(\theta^*)\Var{\nabla\ER(\theta^*)} \EH^{-1}(\theta^*)\\ & \rightarrow & n^{-1} \mathbf{H}^{-1}(\theta^*) \mathbf{K}(\theta^*) \mathbf{H}^{-1}(\theta^*) \\ & = &O(pn^{-1}) \end{eqnarray*} \]
so \( \ERM - \theta^* = O(1/\sqrt{n}) \)
How much noise is there in \( \TrueR(\ERM) \)?
\[ \begin{eqnarray*} \TrueR(\ERM) - \TrueR(\theta^*) & \approx & \frac{1}{2}(\ERM-\theta^*)^T \mathbf{H}(\theta^*) (\ERM-\theta^*)\\ \Expect{f(\ERM) - f(\theta^*)} & \approx & \frac{1}{2}\tr{\Var{(\ERM-\theta^*)}\mathbf{H}(\theta^*)} + \frac{1}{2} \Expect{\ERM -\theta^*}^T \mathbf{H}(\theta^*) \Expect{\ERM -\theta^*}\\ &= & \frac{1}{2}\tr{n^{-1} \mathbf{H}^{-1}(\theta^*) \mathbf{K}(\theta^*) \mathbf{H}^{-1}(\theta^*)\mathbf{H}(\theta^*) }\\ & = & \frac{1}{2}n^{-1}\tr{\mathbf{H}^{-1}(\theta^*)\mathbf{K}(\theta^*)}\\ \Var{\TrueR(\ERM)-\TrueR(\theta^*)} & \approx & \tr{\left(\mathbf{H}(\theta^*) \Var{\ERM-\theta^*} \mathbf{H}(\theta^*) \Var{\ERM-\theta^*}\right)}\\ & \rightarrow & n^{-2} \tr{\left(\mathbf{K}(\theta^*)\mathbf{H}^{-1}(\theta^*)\mathbf{K}(\theta^*)\mathbf{H}^{-1}(\theta^*)\right)}\\ & = & O(pn^{-2}) \end{eqnarray*} \]
The ideal case is well-specified maximum likelihood: then \( \mathbf{K} = \mathbf{H} \), and
\[ \begin{eqnarray*} \ERM & \approx & \theta^* - \EH^{-1}(\theta^*) \nabla\ER(\theta^*)\\ \Expect{f(\ERM) - f(\theta^*)} & \approx & \frac{1}{2}n^{-1} p\\ \Var{\ERM} &\approx & n^{-1} \mathbf{H}^{-1}(\theta^*) \approx n^{-1} \mathbf{H}(\ERM) \\ \Var{f(\ERM)-f(\theta^*)} & \approx & n^{-2} p \end{eqnarray*} \]