Optimization and Its Algorithms

36-465/665, Spring 2021

11 March 2021 (Lecture 11)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]

Previously

Optimization basics

Local vs. global minima

“The” minimum: value vs. location

Finding the minimum: optimization algorithms

So how do we build an optimization algorithm anyway?

Optimizing by equation-solving

Pros and cons of the solve-the-equations approach

Go back to the calculus

Constant-step-size gradient descent

Constant-step-size gradient descent

Gradient descent is basic, but powerful

Estimation error vs. optimization error

Estimation error vs. optimization error (2)

\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]

Don’t bother optimizing more precisely than the noise in the data will support

Beyond gradient descent: Newton’s method

Pros of Newton’s method

Cons of Newton’s method

Gradient methods with big data

\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]

Gradient methods with big data (2)

(src)

A way out: sampling is an unbiased estimate

Stochastic gradient descent

  1. Start with initial guess \(\optimand^{(0)}\), adjustment rate \(a\)
  2. While (not too tired) and (making adequate progress))
    1. At \(t^{\mathrm{th}}\) iteration, pick random \(I\) uniformly on \(1:n\)
    2. Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - \frac{a}{t}\nabla \EmpRisk_{I}(\optimand^{(t)})\)
  3. Return final \(\optimand\)

Stochastic gradient descent (2)

Pros and cons of stochastic gradient methods

More optimization algorithms

Why are there so many different optimization algorithms?

Summing up

References

Culberson, Joseph C. 1998. “On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch’.” Evolutionary Computation 6:109–27. http://www.cs.ualberta.ca/~joe/Abstracts/TR96-18.html.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–36. https://doi.org/10.1038/323533a0.

Traub, J. F., and A. G. Werschulz. 1998. Complexity and Information. Lezioni Lincee. Cambridge, England: Cambridge University Press.

Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1:67–82. https://doi.org/10.1109/4235.585893.