Optimization and Its Algorithms

36-465/665, Spring 2021

11 March 2021 (Lecture 11)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \]


Optimization basics

Local vs. global minima

“The” minimum: value vs. location

Finding the minimum: optimization algorithms

So how do we build an optimization algorithm anyway?

Optimizing by equation-solving

Pros and cons of the solve-the-equations approach

Go back to the calculus

Constant-step-size gradient descent

Constant-step-size gradient descent

Gradient descent is basic, but powerful

Estimation error vs. optimization error

Estimation error vs. optimization error (2)

\[ \text{risk} = \text{minimal risk} + \text{approximation error} + \text{estimation error} + \text{optimization error} \]

Don’t bother optimizing more precisely than the noise in the data will support

Beyond gradient descent: Newton’s method

Pros of Newton’s method

Cons of Newton’s method

Gradient methods with big data

\[ \EmpRisk(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\Loss(y_i, s(x_i; \theta))} \]

Gradient methods with big data (2)


A way out: sampling is an unbiased estimate

Stochastic gradient descent

  1. Start with initial guess \(\optimand^{(0)}\), adjustment rate \(a\)
  2. While (not too tired) and (making adequate progress))
    1. At \(t^{\mathrm{th}}\) iteration, pick random \(I\) uniformly on \(1:n\)
    2. Set \(\optimand^{(t+1)} \leftarrow \optimand^{(t)} - \frac{a}{t}\nabla \EmpRisk_{I}(\optimand^{(t)})\)
  3. Return final \(\optimand\)

Stochastic gradient descent (2)

Pros and cons of stochastic gradient methods

More optimization algorithms

Why are there so many different optimization algorithms?

Summing up


