Lecture 13: Optimization

Statistical Computing, 36-350

Monday November 2, 2015

Outline

Optional reading: take a look at the Convex Optimization class notes at http://www.stat.cmu.edu/~ryantibs/convexopt/

Important examples of optimization problems

Optimization problems

Given an objective function or criterion function \(f: \mathcal{D} \to \mathbb{R}\), find

\[ \theta^* = {\mathop{\mathrm{argmin}}}_{\theta} \; f(\theta) \]

Note that maximizing \(f\) is the same as minimizing \(-f\):

\[ {\mathop{\mathrm{argmax}}}_{\theta} \; f(\theta) = {\mathop{\mathrm{argmin}}}_{\theta} \;-f(\theta) \]

And if \(\phi\) is strictly increasing (e.g., \(\log\)), then

\[ {\mathop{\mathrm{argmin}}}_{\theta} \; f(\theta) = {\mathop{\mathrm{argmin}}}_{\theta}\; \phi(f(\theta)) \]

Considerations

In general, not possible to find exact minimums (or maximums) of optimization problems, but we will learn algorithms that can achieve arbitrarily good approximations, by forming iterative guesses at the minimum (or maximum)

Considerations:

You remember calculus, right?

Suppose \(x\) is one-dimensional and \(f\) is smooth. If \(x^*\) is a local minimum or local maximum of \(f\), then \[ f'(x^*) = 0 \] Is the converse true? (No! Could be a saddle point. Draw yourself an example)

How to tell the difference between local minimums and maximums? Check the second derivative. If \(x^*\) is a local minimum, then in addition to the above, \[ f''(x^*)>0 \] and the opposite sign holds for local maximums

This all carries over to multiple dimensions: if now \(\theta\) is \(n\)-dimensional and \(f\) is smooth, then \[ \nabla f(\theta^*) = 0 \] at any local minimum or local maximum, where recall \(\nabla f\) is the gradient of \(f\), which contains the all partial derivatives: \[ \nabla f(\theta) = \left(\begin{array}{c} \frac{\partial f(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial f(\theta)}{\partial \theta_n} \end{array}\right) \]

What about the second derivative? First we have to recall the generalization of second derivatives to \(n\)-dimensions. This is given by the Hessian of \(f\), written \(\nabla^2 f\). At a point \(\theta\), the Hessian \(\nabla^2 f(\theta)\), is a matrix containing all the mixed partial derivatives: \[ [\nabla^2 f(\theta)]_{ij} = \frac{\partial^2 f(\theta)}{\partial \theta_i \partial \theta_j} \]

At a local minimum \(\theta^*\), the Hessian is going to be positive definite, meaning \[ v^T \nabla^2 f(\theta^*) v > 0, \;\;\; \text{for all $v$} \] This roughly means that the function \(f\) is “curved upwards” around \(\theta^*\)

And the opposite sign will be true for a local maximum (associated property is called negative definite)

Gradients and changes to \(f\)

By definition, \[ f'(x_0) = \lim_{x\rightarrow x_0}{\frac{f(x)-f(x_0)}{x-x_0}} \] so for \(x\) close to \(x_0\), \[ f(x) \approx f(x_0) + f'(x_0)(x-x_0) \] That is, locally the function looks linear, so to minimize it, we move down the slope

Similar idea is true in higher dimensions: for \(x\) close to \(x_0\), \[ f(\theta) \approx f(\theta_0) + \nabla f(\theta_0)^T (\theta-\theta_0) \] so \(f\) looks linear in a small neighborhood, with \(\nabla f(\theta_0)\) pointing in the direction of fastest ascent. So to minimize \(f\), move in direction given by \(-\nabla f(\theta_0)\)

(Note: here \(a^T b = \sum_{i=1}^n a_i b_i\), the inner product or dot product of \(a\) and \(b\))

Gradient descent

Simple but fundamental algorithm for minimizing \(f\). Just repeatedly move in the direction of the negative gradient

  1. Start with initial guess \(\theta^{(0)}\), step size \(\eta\)
  2. For \(k=1,2,3,\ldots\):
    • Compute the gradient \(\nabla f(\theta^{(k-1)})\)
    • Check if gradient is close to zero; is so stop, otherwise continue
    • Update \(\theta^{(k)} = \theta^{(k-1)} - \eta \nabla f(\theta^{(k-1)})\)
  3. Return final \(\theta^{(k)}\) as approximate solution \(\theta^*\)

Gradient descent code

From last lecture, simple code to perform gradient descent. Check that it makes sense to you

library(numDeriv) # to use the grad() function

grad.descent = function(f, x0, max.iter=200, step.size=0.05,
  stopping.deriv=0.01, ...) {
  
  n = length(x0)
  xmat = matrix(0,nrow=n,ncol=max.iter)
  xmat[,1] = x0
  
  for (k in 2:max.iter) {
    # Calculate the gradient
    grad.cur = grad(f,xmat[,k-1],...) 
    
    # Should we stop?
    if (all(abs(grad.cur) < stopping.deriv)) {
      k = k-1; break
    }
    
    # Move in the opposite direction of the grad
    xmat[,k] = xmat[,k-1] - step.size * grad.cur
  }

  xmat = xmat[,1:k] # Trim
  return(list(x=xmat[,k], xmat=xmat, k=k))
}

Recall our example from last time

simpleFun = function(x) { 
  return(x[1]^2 + 1/3*x[2]^2)
}

x0 = c(-1.9,-1.9)
gd = grad.descent(simpleFun,x0,step.size=1/3)
gd$x
## [1] -5.448641e-10 -1.246994e-02
gd$k
## [1] 21

Let’s make a nice plot to visualize the results

surface = function(f, from.x=0, to.x=1, from.y=0, to.y=1, 
                   n.x=30, n.y=30, theta=5, phi=80, ...) {
  # Build the 2d grid
  x.seq = seq(from=from.x,to=to.x,length.out=n.x)
  y.seq = seq(from=from.y,to=to.y,length.out=n.y)
  plot.grid = expand.grid(x.seq,y.seq)
  z.vals = apply(plot.grid,1,f)
  z.mat = matrix(z.vals,nrow=n.x)
  
  # Plot with the persp function 
  orig.mar = par()$mar # Save the original margins
  par(mar=c(1,1,1,1))  # Make the margins small
  r = persp(x.seq,y.seq,z.mat,theta=theta,phi=phi,...)
  par(mar=orig.mar) # Restore the original margins
  invisible(r)
}

# Draw our simple function in 2d, and our gradient descent path on top
r = surface(simpleFun,from.x=-2,to.x=2,from.y=-2,to.y=2,xlab="",ylab="",zlab="")
points(trans3d(x0[1],x0[2],simpleFun(x0),r),col="red",cex=2)
lines(trans3d(gd$xmat[1,],gd$xmat[2,],apply(gd$xmat,2,simpleFun),r),
      lwd=4,col="red")
points(points(trans3d(gd$x[1],gd$x[2],simpleFun(gd$x),r),
              col="black",cex=2))

What’s a good step size?

gd = grad.descent(simpleFun,x0,step.size=0.001)
gd$x
## [1] -1.275649 -1.663865
gd$k
## [1] 200
gd = grad.descent(simpleFun,x0,step.size=2)
gd$x
## [1] 1.682222e+95 4.575293e-03
gd$k
## [1] 200

There is some theory about how to choose step sizes, depending on characteristics of \(f\), but more useful is an empirically adaptive step size choice

Pros and cons of gradient descent

Pros:

Cons:

Interlude: the role of convexity

When is it a good idea to go downhill? And is where we end up going to depend on where we started?

Let’s return to our simple example, and run a bunch more gradient descents starting from different places, to see where they end up

x0 = c(-1.9,-1.9)
gd0 = grad.descent(simpleFun,x0)
x1 = c(-0.1,-1.9)
gd1 = grad.descent(simpleFun,x1)
x2 = c(1,1.9)
gd2 = grad.descent(simpleFun,x2)
x3 = c(-1.9,1.1)
gd3 = grad.descent(simpleFun,x3)
x4 = c(1.9,0)
gd4 = grad.descent(simpleFun,x4)

# Draw our simple function in 2d, and all gradient descent paths on top
r = surface(simpleFun,from.x=-2,to.x=2,from.y=-2,to.y=2,
            xlab="",ylab="",zlab="")
lines(trans3d(gd0$xmat[1,],gd0$xmat[2,],apply(gd0$xmat,2,simpleFun),r),
      lwd=4,col="red")
lines(trans3d(gd1$xmat[1,],gd1$xmat[2,],apply(gd1$xmat,2,simpleFun),r),
      lwd=4,col="orange")
lines(trans3d(gd2$xmat[1,],gd2$xmat[2,],apply(gd2$xmat,2,simpleFun),r),
      lwd=4,col="purple")
lines(trans3d(gd3$xmat[1,],gd3$xmat[2,],apply(gd3$xmat,2,simpleFun),r),
      lwd=4,col="darkgreen")
lines(trans3d(gd4$xmat[1,],gd4$xmat[2,],apply(gd4$xmat,2,simpleFun),r),
      lwd=4,col="blue")

They all end up at the same place! That’s a relief

complicatedFun = function(x) {
  return((1/2*x[1]^2-1/4*x[2]^2+3)*cos(2*x[1]+1-exp(x[2])))
}

x0 = c(0.5,0.5)
gd0 = grad.descent(complicatedFun,x0,step.size=0.01)
x1 = c(-0.1,-1.3)
gd1 = grad.descent(complicatedFun,x1,step.size=0.01,max.iter=400)
x2 = c(-0.5,-1.3)
gd2 = grad.descent(complicatedFun,x2,step.size=0.01,max.iter=400)
x3 = c(-0.2,1.4)
gd3 = grad.descent(complicatedFun,x3,step.size=0.01,max.iter=400)
x4 = c(-0.5,-0.5)
gd4 = grad.descent(complicatedFun,x4,step.size=0.01,max.iter=400)

# Draw our simple function in 2d, and all gradient descent paths on top
r = surface(complicatedFun,from.x=-2,to.x=2,from.y=-2,to.y=2,
            xlab="",ylab="",zlab="")
lines(trans3d(gd0$xmat[1,],gd0$xmat[2,],apply(gd0$xmat,2,complicatedFun),r),
      lwd=4,col="red")
lines(trans3d(gd1$xmat[1,],gd1$xmat[2,],apply(gd1$xmat,2,complicatedFun),r),
      lwd=4,col="orange")
lines(trans3d(gd2$xmat[1,],gd2$xmat[2,],apply(gd2$xmat,2,complicatedFun),r),
      lwd=4,col="purple")
lines(trans3d(gd3$xmat[1,],gd3$xmat[2,],apply(gd3$xmat,2,complicatedFun),r),
      lwd=4,col="darkgreen")
lines(trans3d(gd4$xmat[1,],gd4$xmat[2,],apply(gd4$xmat,2,complicatedFun),r),
      lwd=4,col="blue")

Oh no! What’s going on here? It seems like we end up at drastically different spots depending on where we started

The fundamental difference between these two examples is that of convexity. A convex function is one that is “bowl shaped”, or always “curved upwards”; more precisely, its second derivative matrix is always positive definite

Back to Taylor series approximation

We saw that for \(x\) close to \(x_0\), we could write \[ f(x) \approx f(x_0) + f'(x_0)(x-x_0) \] This is called a first-order Taylor approximation, using the first derivative

Meanwhile, a second-order Taylor approximation uses the second dervative: \[ f(x) \approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2 \] This is indeed a better (more accurate) approximation for \(x\) close to \(x_0\) if we do a quadratic approximation to \(f\)

This carries over to the multivariate case: \[ f(\theta) \approx f(\theta_0) + \nabla f(\theta)^T(\theta-\theta_0) + \frac{1}{2}(\theta-\theta_0)^T \nabla^2 f(\theta_0) (\theta-\theta_0) \] The idea we’ll follow now is try to make the quadratic on the right-hand side as small as possible, then Taylor expand again, and then repeat

Minimizing a quadratic

If we know \[ f(x) = ax^2 + bx + c \] then we can minimize \(f\) exactly: \[ x^* = -b/(2a) \]

If instead \[ f(x) = \frac{1}{2}a (x-x_0)^2 + b(x-x_0) + c \] then \[ x^* = x_0 - b/a \]

Similarly for multidimensional case

Newton’s method

More sophisticated than gradient descent, but also more accurate. Repeatedly form and minimize quadratic approximation to \(f\)

  1. Start with initial guess \(\theta^{(0)}\), step size \(\eta\)
  2. For \(k=1,2,3,\ldots\):
    • Compute the gradient \(g=\nabla f(\theta^{(k-1)})\), Hessian \(H=\nabla^2 f(x^{(k-1)})\)
    • Check if gradient is close to zero; is so stop, otherwise continue
    • Update \(\theta^{(k)} = \theta^{(k-1)} - \eta H^{-1} g\)
  3. Return final \(\theta^{(k)}\) as approximate solution \(\theta^*\)

Pros and cons of Newton’s method

Pros:

Cons:

Other approaches

What else can we do? There are an enormous amount of options

We will briefly touch on cover stochastic gradients and proximal (thresholding) maps

If you’re interested beyond this, then you can browse through the class slides from http://www.stat.cmu.edu/~ryantibs/convexopt/

Optimization in R: optim()

Often times it’s most convenient just to use built-in R optimization routines. Most functions in R will allow you to access several methods, like optim()

optim(par, fn, gr, method, control, hessian)

Return contains the location (par) and the value (val) of the optimum, diagnostics, possibly the Hessian at convergence (hessian)

gmp = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/gmp.dat")
gmp$pop = gmp$gmp/gmp$pcgmp
library(numDeriv)
mse = function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) }
grad.mse = function(theta) { grad(func=mse,x=theta) }
theta0=c(5000,0.15)
fit1 = optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)
fit1
## $par
## [1] 6493.2563739    0.1276921
## 
## $value
## [1] 61853983
## 
## $counts
## function gradient 
##       63       11 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##              [,1]         [,2]
## [1,]      52.5021      4422071
## [2,] 4422071.3594 375729087979
eigen(fit1$hessian)$values
## [1] 3.757291e+11 4.573753e-01

Optimization in R: nls()

optim() is a general-purpose optimizer. nlm() is for nonlinear least squares in particular, like the previous GMP example

nls(formula, data, start, control, algorithm, [[many other options]])

Returns an nls object, which you can pretty much use like an lm object (print, plot, get fitted values, make predictions, etc.)

fit2 = nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1))
summary(fit2)
## 
## Formula: pcgmp ~ y0 * pop^a
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## y0 6.494e+03  8.565e+02   7.582 2.87e-13 ***
## a  1.277e-01  1.012e-02  12.612  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7886 on 364 degrees of freedom
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.819e-07
plot(pcgmp~pop,data=gmp,log="x")
curve(fit1$par[1]*x^fit1$par[2],add=TRUE,col="blue")
curve(coef(fit2)[1]*x^coef(fit2)[2],add=TRUE,col="red")

Can’t tell any difference between the two! That’s a good thing

Two advanced topics: stochastic methods, sparse methods

Suppose we are running a linear regression on \(p\) variables, and our criterion function is \[ f(\theta) = \frac{1}{2n} \sum_{i=1}^n (y_i - x_i^T \theta)^2 \] the standard least squares loss, given responses \(y_i\) and predictors \(x_i\), \(i=1,\ldots n\)

If \(n,p\) are reasonable in size, then we can just solve this using linear algebra, as in \[ \theta^* = (X^T X)^{-1} X^T y \] where \(X\) is the \(n\times p\) predictor matrix (with \(i\)th row \(x_i\)), and \(y\) is the \(n\times 1\) response vector (with \(i\)th component \(y_i\))

But now suppose \(n\) something like 100 million. It would be prohibitive to even form \(X^T X\), so the above solution is out of the question, and we might try gradient descent. However, gradient descent is also expensive, because the gradient \[ \nabla f(\theta) = \frac{1}{n} \sum_{i=1}^n x_i(x_i^T \theta - y_i) \] requires a sum over \(n\) things, which again costs a lot. What to do?

This is where stochastic methods come into play. They are widely used for large-scale statistical optimization problems in industry. The idea is simple: each time you want to use it, replace the gradient with a random estimate

That is, instead of updating according to \[ \theta - t \nabla f(\theta) = \theta - \frac{t}{n} \sum_{i=1}^n x_i(x_i^T \theta - y_i) \] updating according to \[ \theta - t g = \theta - \frac{t}{m} \sum_{i \in I} x_i (x_i^T \theta - y_i) \] Here \(I\) is a random subset of \(\{1,\ldots n\}\) of size \(m\) (where \(m\) is much, much smaller than \(n\)). In other words, we’ve replaced the full gradient by \[ g = \frac{1}{m} \sum_{i \in I} x_i(x_i^T \theta - y_i) \] and unbiased estimate of it, computed over just \(m\) data points. This is much cheaper when \(m\) is reasonably small

Stochastic methods can save a lot of computation; they generally are rapid to make progress in minimizing the criterion at the start, but then take longer to get to high accuracy solutions

Now let’s consider another variant: suppose \(p\), the number of predictors, is enormous—say, in the millions itself. For both statistical reasons and optimization reasons, we don’t want to optimize over \(p\) regression parameter

Statistically: we will get bad estimates (high variance), optimization-wise: it will be very expensive and slow to converge

(You’ll learn a lot more about the statistical side in 36-401, and further in 36-402)

Idea: if we have 1 million regression parameters, then it’s unlikely that they’re all super important. So let’s ignore the small ones, basically, by just setting them equal to zero. This is called a sparse method, and we can work this directly into gradient descent:

  1. Start with initial guess \(\theta^{(0)}\), step size \(\eta\)
  2. For \(k=1,2,3,\ldots\):
    • Compute the gradient \(\nabla f(\theta^{(k-1)})\)
    • Check if gradient is close to zero; is so stop, otherwise continue
    • Update \(\theta^{(k)} = \theta^{(k-1)} - \eta \nabla f(\theta^{(k-1)})\)
    • Correct \(\theta^{(k)}\) by thresholding the small components to zero
  3. Return final \(\theta^{(k)}\) as approximate solution \(\theta^*\)

The addendum can be viewed as a particular special case of what is called proximal gradient descent. If make all components of \(\theta\) smaller (at the same time that we threshold the small ones to zero), then this is actually guaranteed to converge, and will produce for us what is called a lasso solution!

Summary

  1. In optimization, we think about tradeoffs: complexity of iteration vs number of iterations vs precision of approximation
    • Gradient descent: less complex iterations, more iterations in total, more robust
    • Newton: more complex iterations, but few of them for good functions, less robust
  2. Start with pre-built code like optim() or nls(), which are incredibly useful. Implement your own as needed
  3. Stochastic methods are critical when you have a lot of data; sparse methods are useful when you have a lot of parameters