\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]

Previously

Empirical risk of strategy \(s\) is \(\EmpRisk(s) = n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}\)
We want to find the empirical risk minimizer \(\hat{s}\), \[ \hat{s} \equiv \argmin_{s \in \ModelClass}{\EmpRisk(s)} \]
We’ve used properties of the model class \(\ModelClass\) to control generalization error
We’ve used how \(\ModelClass\) behaves under this distribution of data to control generalization error (Rademacher complexity)
What about using properties of the actual fitting process?
We’re now going to open up the black box of \(\argmin\)

Optimization basics

The function we’re trying to optimize is the objective function, let’s say \(\ObjFunc\) today
The argument to \(\ObjFunc\) is (say) \(\optimand\)
- Some people call this the optimand
The possible values of \(\optimand\) is \(\OptDomain\), the domain or feasible set, whose dimension is (say) \(\OptDim\)
Optimization can be minimization or maximization, as we like; we’ll stick with minimizing

Local vs. global minima

\(\optimand\) is a global minimum when \(\altoptimand \neq \optimand\) \(\Rightarrow\) \(\ObjFunc(\altoptimand) \geq \ObjFunc(\optimand)\)
- Not necessarily unique!
\(\optimand\) is a local minimum when \(\ObjFunc(\optimand) \leq \ObjFunc(\altoptimand)\) whenever \(\altoptimand\) is close enough to \(\optimand\)
- Every global minimum is also a local minimum
- If there’s only one local minimum anywhere, it’s the global minimum
Lots of local minima tend to make it harder to find the global minimum

“The” minimum: value vs. location

If \(\optimand^*\) is a global minimum, then \(\ObjFunc(\optimand^*)\) is the value of the minimum or minimal value, in symbols \[ \min_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
But \(\optimand^*\) itself is the location of the global minimum, in symbols \[ \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
Example: the minimal value of \((x-1)^2\) is 0, but the location of the minimum is \(x=1\)
Both value and location can change with \(\OptDomain\); this will be important when we look at constraints later

Finding the minimum: optimization algorithms

An optimization algorithm starts from \(\ObjFunc\) and \(\OptDomain\), and possibly an initial guess \(\optimand^{(0)}\), and returns an approximation to \(\argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand)}\), call it \(\outputoptimand\)
- Really the algorithm takes in representations of all these things, and sometimes there are interesting issues with better or worse representations, amount of detail in the representation (Traub and Werschulz 1998), etc.
We usually measure the approximation by difference in the value, not the location: the algorithm gets \(\epsilon\)-close when \[ \ObjFunc(\outputoptimand) \leq \epsilon + \min_{\optimand \in \OptDomain}{\ObjFunc(\optimand)} \]
Often, the longer we let the algorithm run, the better the approximation
- How many steps does the algorithm need to get \(\epsilon\)-close to the optium? \(O(1/\epsilon)\) or \(O(\epsilon^{-d})\) would be polynomial, \(O(\log{1/\epsilon})\) would be logarithmic, \(\myexp{O(1/\epsilon)}\) would be exponential (and bad)

So how do we build an optimization algorithm anyway?

Start with calculus, assume \(\ObjFunc\) has as many derivatives as we need
- All derivatives are with respect to \(\optimand\)
At an (interior!) local minimum, or local maximum, or inflection point, \[ \nabla \ObjFunc(\optimand) = 0 \]
- \(\ObjFunc\) is flat at the minimum
- In 1D, this is just \(d\ObjFunc/d\optimand = 0\)
- This is called the first-order condition (or conditions)
At an (interior) local minimum, but not any other sort of extremal point, for any \(\vec{v} \neq 0\), \[ \vec{v} \cdot \nabla\nabla \ObjFunc(\optimand) \vec{v} \geq 0 \]
- and it’s positive in at least some directions!
- Remember \(\nabla\nabla \ObjFunc\) is the matrix of 2nd partial derivatives, or Hessian, so I will also write \(\Hessian\)
- Jargon: this condition is what’s meant when we say the Hessian is non-negative-definite, in symbols \(\Hessian(\theta) \succeq 0\)
  - Positive-definite, \(\Hessian(\theta) \succ 0\), means \(\vec{v} c\dot \Hessian \vec{v} > 0\) for all \(\vec{v} \neq 0\); this implies \(\theta\) is an isolated local minimum
- In words: moving away from the local minimum in any direction can only increase the objective function
- In 1D this is just \(\frac{d^2\ObjFunc}{d\optimand^2} > 0\)
\(\nabla \ObjFunc(\optimand)=0\) is the first-order condition and \(\Hessian(\optimand) \succeq 0\) is the second-order condition

Optimizing by equation-solving

One approach: use the first-order condition to get a system of equations \[ \nabla \ObjFunc(\optimand) = 0 \]
- One equation per coordinate of \(\optimand\)
- When \(\ObjFunc\) is empirical risk \(\EmpRisk\), sometimes called the estimting equations or even normal equations
Solve the system of equations for \(\optimand\)
If there’s more than one solution, check the second-order conditions
This is what you did for estimating linear models by ordinary least squares, or even weighted least squares

Pros and cons of the solve-the-equations approach

You need to set up the system of equations, and often finding \(\nabla \ObjFunc\) would itself be a pain
- Numerical differentiation is a thing, however
You need to solve a system of equations: good if there are good solvers for that type of system of equations, not so good otherwise
- 200+ years of work have given us very good solvers for linear systems
  - For linear systems, even very old-fashioned methods that go back to Gauss around 1800 get \(\epsilon\) approximations with \(O(\log{1/\epsilon})\) iterations
- General-purpose nonlinear equation-solving is still much harder
  - sometimes works by using Taylor expansion to linearize
  - sometimes works by turning the solve-the-equations into “minimize the difference between the left and the right hand side of the equation”

Go back to the calculus

Start with a guess \(\optimand^{(0)}\)
Find \(\nabla \ObjFunc(\optimand^{(0)})\)
Move in the opposite direction: \[ \optimand^{(1)} = \optimand^{(0)} - a_0 \nabla \ObjFunc(\optimand^{(0)}) \]
Repeat: \[ \optimand^{(t+1)} = \optimand^{(t)} - a_t \nabla \ObjFunc(\optimand^{(t)}) \]
Notice: a local optimum will be a fixed point!
Issue: how big are the step sizes \(a_t\)?
- Confusingly also called the learning rate

Constant-step-size gradient descent

Inputs: objective function \(\ObjFunc\), step size \(a\), initial guess \(\optimand^{(0)}\)
While ((not too tired) and (making adequate progress))
- Find \(\nabla \ObjFunc(\optimand^{(t)})\)
- Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - a \nabla \ObjFunc(\optimand^{(t)})\)
Return final \(\optimand\)
“not too tired”: Set a maximum number of iterations
“making adequate progress”: \(ObjFunc\) isn’t changing by too little to bother with, \(\optimand\) isn’t changing by too little to bother with, \(\nabla \ObjFunc\) isn’t too close to zero

Constant-step-size gradient descent

Pick an \(a > 0\) that’s small and use it at each step
Each iteration of gradient descent takes \(O(\OptDim))\) operations
- Find \(\OptDim\) derivatives, multiply by \(a\), add to \(\optimand^{(t-1)}\)
If \(\ObjFunc\) is nice, \(\optimand^{(t)}\) is an \(\epsilon\)-approximation of the optimum after \(t=O(\epsilon^{-2})\) iterations
- i.e. at that point \(\ObjFunc(\optimand^{(t)}) \leq \epsilon+\min{\ObjFunc(\optimand)}\)
- “Nice” here means: convex and second-differentiable
If \(\ObjFunc\) is very nice, \(\optimand^{(t)}\) is an \(\epsilon\)-approximation after only \(t=O(\log{1/\epsilon})\) iterations
- “nice” plus strictly convex

Gradient descent is basic, but powerful

Gradient descent works well when there’s a single global minimum, no flat parts to the function, and the step size is small enough to not over-shoot or zig-zag
It’s actually been re-invented a number of times under different names
- e.g., “back-propagation” (Rumelhart, Hinton, and Williams 1986)
It’s the work-horse for large-scale industrial applications in modern machine learning
- especially as stochastic gradient descent
It’s still a bit mysterious why it works so well for those applications, which actually have lots of local minima!

Estimation error vs. optimization error

Remember our approximation error vs. estimation error decomposition: \[\begin{eqnarray} \Risk(\hat{s}) & = & \Risk(\OptimalStrategy) + (\Risk(\OptimalModel) - \Risk(\OptimalStrategy)) + (\Risk(\hat{s}) - \Risk(\OptimalModel))\\ & = & (\text{true minimum risk}) + (\text{approximation error from limited strategy set})\\ & & + (\text{estimation error from not knowing the best-in-class set}) \end{eqnarray}\]
Now we don’t even have \(\hat{s} = \argmin{\EmpRisk(s)}\), we have \(\hat{s}_{\mathrm{out}}\), the output of some algorithm \[\begin{eqnarray} \Risk(\hat{s}_{\mathrm{out}}) & = & \Risk(\OptimalStrategy) + (\Risk(\OptimalModel) - \Risk(\OptimalStrategy)) + (\Risk(\hat{s}) - \Risk(\OptimalModel)) + (\Risk(\hat{s}_{\mathrm{out}}) - \Risk(\hat{s}))\\ &= & (\text{optimal risk}) + (\text{approximation error}) + (\text{estimation error}) + (\text{optimization error}) \end{eqnarray}\]
Optimization error \(\approx\) what I’ve been calling \(\epsilon\)
- only \(\approx\) because of \(\Risk\) vs. \(\EmpRisk\) issue

Estimation error vs. optimization error (2)

\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]

Minimal risk and approximation error don’t change with \(n\) or with how we optimize
Estimation error shrinks with \(n\): for large \(n\), typically \(O(\OptDim/n)\)
- Possibly more slowly converging in \(n\) for some families, or if \(\OptDim\) grows with \(n\)
- VC theory tells us estimation error can be \(O(\frac{\log{n/d}}{n/d})\)
Optimization error shrinks as we do more computational work
There’s no point to making the optimization error much smaller than the estimation error
- More exactly: lots of work for little real benefit
So: don’t try to make the optimization error much smaller than \(O(\OptDim/n)\)

Don’t bother optimizing more precisely than the noise in the data will support

Beyond gradient descent: Newton’s method

Needing to pick the step-size \(a_t\) is annoying
We’d like to take big steps, but \(\nabla \ObjFunc\) is a local quantity and might be mis-leading far away
\(\Rightarrow\) We’d like to take bigger steps when the gradient doesn’t change much
This is Newton’s method: \[ \optimand^{(t+1)} = \optimand^{(t)} - \left(\Hessian(\optimand^{(t)}) \right)^{-1} \nabla \ObjFunc(\optimand^{(t)}) \]
- One route to this: pretend \(\ObjFunc\) is quadratic, as justified by a Taylor expansion around the true minimum
This is like gradient descent, but using the inverse Hessian to give the step size
- And possibly a bit of rotation away from the gradient

Pros of Newton’s method

Adaptively-chosen step size means harder to get zig-zags, over-shooting, etc.
Need \(O(\epsilon^{-2})\) steps to get an \(\epsilon\) approximation to the minimum for nice functions
For very nice functions, only need \(O(\log{(\log{(1/\epsilon)})})\) iterations
Generally needs many fewer iterations than gradient descent

Cons of Newton’s method

Hopeless if the Hessian doesn’t exist or isn’t invertible
Need to take \(O(p^2)\) second derivatives and \(p\) first derivatives, total \(O(p^2)\)
Need to find \(\optimand^{(t+1)}\)
- Seems straightforward, it’s \(\optimand^{(t+1)} = \optimand^{(t)} - \left(\Hessian(\optimand^{(t)}) \right)^{-1} \nabla \ObjFunc(\optimand^{(t)})\)
- But inverting a \([p\times p]\) matrix takes \(O(p^3)\) operations in general, so this would be an \(O(p^3)\) step
Alternative: solve \(\Hessian \optimand^{(t+1)} = \Hessian \optimand^{(t)} - \nabla \ObjFunc(\optimand^{(t)})\) for \(\optimand^{(t+1)}\) for the unknown \(\optimand^{(t+1)}\)
- (Take the basic update equation for Newton’s method and multiply both sides by \(\Hessian\) from the left)
- Solving a system of \(p\) linear equations can be done in \(O(p^2)\) time for a particular right-hand side
- Lots of variants to use approximate Hessians rather than the full deal (BFGS, built in to R’s optim(), is one of these)
So each iteration is \(O(p^2)\), much slower than gradient descent’s \(O(p)\)
- \(O(p^2)\) to get Hessian and gradient plus \(O(p^2)\) to solve for update \(=O(p^2)\)

Gradient methods with big data

\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]

Getting a value of \(\EmpRisk\) at a particular \(\theta\) is \(O(n)\), getting \(\nabla \EmpRisk\) is \(O(np)\), getting \(\Hessian\) is \(O(np^2)\)
- And that’s assuming calculating \(s(x_i; \theta)\) doesn’t slow down with \(n\)
Maybe OK when \(n=100\) or \(n=10^4\), but with \(n=10^9\) or \(n=10^{12}\), we really don’t know which way to move

Gradient methods with big data (2)

(src)

A way out: sampling is an unbiased estimate

Pick one data point \(I\) at random, uniform on \(1:n\)
\(\Loss(y_I, s(x_I; \theta))\) is random, but \[ \Expect{\Loss(y_I, s(x_I; \theta))} = \EmpRisk(\theta) \]
Re-brand \(\Loss(y_I, s(x_I; \theta))\) as \(\EmpRisk_{I}(\theta)\) \[\begin{eqnarray} \Expect{\EmpRisk_{I}(\theta)} & = & \EmpRisk(\theta)\\ \Expect{\nabla \EmpRisk_{I}(\theta)} & = & \nabla \EmpRisk(\theta)\\ \Expect{\nabla \nabla \EmpRisk_{I}(\theta)} & = & \Hessian(\theta) \end{eqnarray}\]
\(\Rightarrow\) Don’t optimize with all the data, optimize with random samples

Stochastic gradient descent

Draw lots of random one-point samples and let their noise cancel out:

Start with initial guess \(\optimand^{(0)}\), adjustment rate \(a\)
While (not too tired) and (making adequate progress))
1. At \(t^{\mathrm{th}}\) iteration, pick random \(I\) uniformly on \(1:n\)
2. Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - \frac{a}{t}\nabla \EmpRisk_{I}(\optimand^{(t)})\)
Return final \(\optimand\)

Shrinking step-sizes by \(1/t\) ensures noise in each gradient dies down

Stochastic gradient descent (2)

Tons of variants:
- Put the data points \(1:n\) in a random order and then cycle through them
- Don’t check the “making adequate progress” condition too often
- Adjust the \(1/t\) step-size to some other function
- Stochastic Newton’s method: Use the sample to also calculate the Hessian and take a Newton’s method step
- Mini-batch: Sample a few of random data points at once
- Mini-batch stochastic Newton’s method, etc.

Pros and cons of stochastic gradient methods

Pro: Each iteration is (or at least constant in \(n\))
Pro: Never need to hold all the data in memory at once
Pro: Does converge eventually (at least if the non-stochastic method would)
Cons: sampling noise increases optimization error
- That is: more iterations to come within the same \(\epsilon\) of the optimum as non-stochastic GD or Newton
Over-all pro: often low computational cost to make the optimzation error small compared to the estimation error

More optimization algorithms

Ones which play more tricks with derivatives than just gradient descent and Newton (“conjugate gradient”, etc., etc.)
Ones which avoid derivatives (“simplex” or “Nelder-Mead”)
Ones which avoid derivatives and try random changes (“simulated annealing”)
Ones which use natural-selection-with-random-variation to evolve a whole population of approximate optima (“genetic algorithms”)

Why are there so many different optimization algorithms?

“Come up with a new algorithm” is a way to make a mark…
No one algorithm works well on every problem
- Sometimes obvious: don’t use Newton’s method if \(\OptDomain\) is discrete
Fundamental limit: no algorithm is universally better than others on every problem (no free lunch theorem of Wolpert and Macready (1997))
- For every problem where your favorite algorithm does better than mine, I can design a new problem where my algorithm leads by just as much (Culberson 1998)
We need to know something about the problem to select a good optimizer

Summing up

With real data and real computers, finding the empirical-risk-minimizer means using an algorithm to solve an optimization problem
These algorithms almost never give the exact optimum but just an approximation
Usually, the longer an algorithm is allowed to work, the closer it can get to the true optimum
This adds optimization error on to estimation error
For many statistical learning problems, gradient descent and Newton’s method work really well
- With sampling to make them more computationally efficient for big data
Don’t bother reducing the optimization error much beyond the estimation error
No algorithm is best for all problems

References

Culberson, Joseph C. 1998. “On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch’.” Evolutionary Computation 6:109–27. http://www.cs.ualberta.ca/~joe/Abstracts/TR96-18.html.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–36. https://doi.org/10.1038/323533a0.

Traub, J. F., and A. G. Werschulz. 1998. Complexity and Information. Lezioni Lincee. Cambridge, England: Cambridge University Press.

Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1:67–82. https://doi.org/10.1109/4235.585893.

Optimization and Its Algorithms

Previously

Optimization basics

Local vs. global minima

“The” minimum: value vs. location

Finding the minimum: optimization algorithms

So how do we build an optimization algorithm anyway?

Optimizing by equation-solving

Pros and cons of the solve-the-equations approach

Go back to the calculus

Constant-step-size gradient descent

Constant-step-size gradient descent

Gradient descent is basic, but powerful

Estimation error vs. optimization error

Estimation error vs. optimization error (2)

Beyond gradient descent: Newton’s method

Pros of Newton’s method

Cons of Newton’s method

Gradient methods with big data

Gradient methods with big data (2)

A way out: sampling is an unbiased estimate

Stochastic gradient descent

Stochastic gradient descent (2)

Pros and cons of stochastic gradient methods

More optimization algorithms

Why are there so many different optimization algorithms?

Summing up

References