\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

We’ve been thinking about selecting strategies by minimizing the empirical risk \(\EmpRisk(s) = n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}\)
We’ve looked at properties of the empirical risk minimizer \[ \hat{s} \equiv \argmin_{s \in \ModelClass}{\EmpRisk(s)} \]
We looked last time at optimization algorithms, which actually give us approximations to \(\hat{s}\)
What if we want to do something other than minimizing empirical risk?

Think about ordinary least squares

\(Y=\) a real-number random variable
\(X=\) a vector with \(p\) dimensions
- Make one coordinate always \(1\) if we want an intercept
\(A=\) a one-number guess about \(Y\)
\(\Loss(y,a) = (y-a)^2\)
Strategies \(S=\) linear functions of \(X\), parameterized by coefficient vector \(b\) so \(s(x;b) = b \cdot x\)
Data \((x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)\), form into \(n\)-row matrices \(\mathbf{x}\) and \(\mathbf{y}\)
There’s an exact formula for coefficient vector \(\beta\) of the optimal strategy \(\OptimalModel\): \[ \beta = (\Var{X})^{-1} \Cov{X,Y} \]
There’s an exact formula for the ERM: \[ \hat{\beta} = (\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \]

Think about ordinary least squares (2)

\[ \hat{\beta} = (\mathbf{x}^T\mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \]

Doesn’t work if \(\mathbf{x}^T\mathbf{x}\) can’t be inverted
If \(\mathbf{x}^T\mathbf{x}\) is close to being un-invertible, this becomes really unstable numerically
- Small changes in \(\mathbf{x}\) or \(\mathbf{y}\) lead to really big changes in \(\hat{\beta}\), and so in predictions
\(\mathbf{x}^T\mathbf{x}\) is symmetric and non-negative-definite so it has an eigen-decomposition, \[ \mathbf{x}^T \mathbf{x} = \mathbf{v}^T \mathbf{d} \mathbf{v} \]
- \(\mathbf{d} =\) diagonal matrix of eigenvalues, \(\mathbf{v} =\) orthognal matrix of normalized eigenvectors
If inversion is possible \[ (\mathbf{x}^T\mathbf{x})^{-1} = (\mathbf{v}^T \mathbf{d} \mathbf{v})^{-1} = (\mathbf{v})^{-1} \mathbf{d}^{-1} (\mathbf{v}^T)^{-1} = \mathbf{v}^T \mathbf{d}^{-1} \mathbf{v} \]
- \(\mathbf{v}\) is orthogonal iff \(\mathbf{v}^{T} = \mathbf{v}^{-1}\)
Any eigenvalues \(=0\) iff no inverse \((\mathbf{x}^T\mathbf{x})^{-1}\)
Any eigenvalues \(\approx 0\) implies huge reciprocal eigenvalues and instability
- A “small denominators” problem

Thinking about ordinary least squares (3)

Geometrically, we can’t invert \(\mathbf{x}^T\mathbf{x}\) when its columns are collinear
- One column is an exact linear function of one or more of the other columns, say \(x^{(i)} = \sum_{j\neq i}{a_j x^{(j)}}\)
- Implies \(0 = \sum_{j=1}^{p}{a_j x^{(j)}}\) with \(a_i = -1\)
- The vector \(a\) will be an eigenvector of \(\mathbf{x}^T\mathbf{x}\) with eigenvalue \(0\)
Near collinearity \(\Rightarrow\) small eigenvalues \(\Rightarrow\) instability
This can happen if we’re careless about variable choice
- Standard example: regress on an average and on the individual measurements going in to the average
- Or \(x^{(1)}\) is mass in kilograms, \(x^{(2)}\) is weight in pounds, and we can add whatever \(b\) we like to \(\beta_1\) if we also subtract \(2.2 b\) from \(\beta_2\)

Thinking about ordinary least squares (4)

Collinearity is inevitable if we have a lot of variables
More geometry: 2 points define a line, 3 points define a plane, etc.
In general \(d+1\) points define a \(d\)-dimensional linear subspace
If \(n < p\), then the \(n\) data points define an \(n-1\) dimensional subspace of the \(p\)-dimensional space
\(\therefore\) If \(n < p\), then collinearity is guaranteed

Thinking about ordinary least squares (5)

Usually we want more information, not less!
With modern techniques it’s easy to have \(p > n\) even when \(n\) is big
“We know too much about each data point to fit a model” sounds absurd
Can we stabilize things somehow, and get rid of the small denominators?

Penalties

One approach is to add a penalty term: pick \(\lambda \geq 0\) and solve \[ \min_{b}{\hat{r}(b) + \lambda\Penalty(b)} \]
The penalty function \(\Penalty\) should be \(\geq 0\), and should, somehow, make the optimization problem more stable
- \(\lambda\) is called the penalty factor or the strength of the penalty
Two popular choices of penalties: the “\(L_1\)” penalty, \[ \Penalty(b) = \sum_{j=1}^{p}{|b_j|} \] and the “\(L_2\)” penalty, \[ \Penalty(b) = \sum_{j=1}^{p}{b_j^2} \]
- \(=\) squared (Euclidean) length of \(b\)
You can guess what the \(L_q\) penalty is
- There are also penalties not of this form

Penalties

For now, we’ll use the \(L_2\) penalty just to be definite. So now we need to pick a \(\lambda\) and we’ll use \[ \hat{\beta}(\lambda) = \argmin_{b}{\hat{r}(b) + \lambda \sum_{j=1}^{p}{b_j^2}} \]
The first bit is just the mean squared error, so we want that to be small
The second bit is the penalty, which wants to make the coefficient vector small
\(\lambda\) controls the trade-off between empirical risk and the length of the coefficient vector
- Reducing the MSE by 1 unit is “worth” increasing the length of \(b\) by \(1/\lambda\) units

Some pictures

First let’s draw what the empirical-risk surface \(\EmpRisk(b)\) looks like
- Some made-up 2D data; see .Rmd file for the details

Some pictures (2)

What does the \(L_2\) penalty surface \(\Penalty(b)\) look like?
- Marked the origin and the minimum of \(\EmpRisk\)

Some pictures (3)

Now what does the combined empirical-risk-and-penalty surface \(\EmpRisk(b) + \lambda \Penalty(b)\) look like?
- \(\lambda=1\) for simplicity

What does the penalty do?

It shrinks the estimated coefficient vector towards the origin, away from the empirical risk minimizer
Same shrinkage no matter what data, so the penalized estimate is
- Less sensitive to the data
- More stable in the face of noise in the data
- Lower variance than the ERM
- More biased than the ERM (unless the optimal vector really is the origin)
We say that the penalty regularizes the optimization problem, and the estimate
Bigger \(\lambda \Rightarrow\) more regularization

What specifically does the \(L_2\) penalty do?

It’s not too hard to show (HW!) that \[ \hat{\beta}(\lambda) = (\mathbf{x}^T \mathbf{x} + (\text{spoiler}) \mathbf{I})^{-1} \mathbf{x}^T\mathbf{y} \]
- So we add something to the diagonal of \(\mathbf{x}^T\mathbf{x}\)
- Even if \(\mathbf{x}\) is collinear, this will break the collinearity when we come to calculate the coefficients
- Keeps the eigenvalues from getting too close to 0
This is called ridge regression or Tikhonov regularization
- Adding a “ridge” along the diagonal of the \(\mathbf{x}^T\mathbf{x}\) matrix

What about \(L_1\)?

What does the \(L_1\) penalty look like?

What about \(L_1\)? (2)

Combined empirical risk plus \(L_1\) penalty

The “corners” of \(L_1\) come through, and favor driving some coordinates of the coefficient vector to \(0\)
- \(L_1\) is a sparsity-promoting penalty, unlike \(L_2\)
Least squares plus \(L_1\) penalty is called the lasso
There’s no closed formula for the lasso, unlike ridge regression

Penalties \(\Leftrightarrow\) Constraints

Another way to regularize is to add a constraint \[ \hat{\beta}(c) = \argmin_{b: \Penalty(b) \leq c}{\EmpRisk(b)} \]
The constraint is that \(\Penalty(b) \leq c\)
The constraint reduces the feasible set (as defined last time)
An \(L_2\) constraint would say: “Find us the coefficient vector with the smallest MSE, among all vectors whose length is \(\leq \sqrt{c}\)”
- Whereas ordinary least squares says: “Find us the coefficient vector with the smallest MSE, no matter how long it might be”

Constrained optimization in general

Get a bit more abstract for a moment and think about constrained optimization in general \[\begin{eqnarray} \optimand^* & = & \min_{\optimand \in \OptDomain}{\ObjFunc(\optimand)}\\ & \text{subject to} &\\ \Penalty(\optimand) & \leq & c \end{eqnarray}\]
How might we solve this?

Use the constraint equation \(\Penalty(\optimand) = c\) to eliminate a degree of freedom
- i.e., write one coordinate in \(\optimand\) as a function of the others and of \(c\)
- Do unconstrained optimization over the remaining degrees of freedom
- What about the \(\leq\) case?!?
Add a new variable and do unconstrained optimization over a larger problem

Lagrange multipliers

If we have one equality constraint, say \(\Penalty(\optimand) = c\), we’d add a Lagrange multiplier: \[ \min_{\optimand \in \OptDomain, \lambda \in \mathbb{R}}{\ObjFunc(\optimand) + \lambda(\Penalty(\optimand) - c} = \min_{\optimand \in \OptDomain, \lambda \in \mathbb{R}}{\Lagrangian(\optimand, \lambda)} \]
Do unconstrained optimization over both \(\optimand\) and \(\lambda\); start with the derivatives \[\begin{eqnarray} \frac{\partial \Lagrangian}{\partial \lambda} & = & \Penalty(\optimand) - c\\ \frac{\partial \Lagrangian}{\partial \optimand} & = & \frac{\partial \ObjFunc}{\partial \optimand} + \lambda \frac{\partial \Penalty}{\partial \optimand} \end{eqnarray}\]
- Set derivatives to zero and solve for \(\optimand^*, \lambda^*\)
- One extra unknown but also one extra equation…

Lagrange multipliers (2)

Set to zero: \[\begin{eqnarray} \Penalty(\optimand^*) & = & c\\ \frac{\partial \ObjFunc}{\partial \optimand}(\optimand^*) & = & -\lambda^* \frac{\partial \Penalty}{\partial \optimand}(\optimand^*) \end{eqnarray}\]
First equation is the constraint again
- So \(\Lagrangian(\optimand^*, \lambda^*) = \ObjFunc(\optimand^*)\) always
Second equation involves both \(\optimand^*\) and \(\lambda^*\)
\(\optimand^*\) will not be the same as the unconstrained optimum \(\optimand\)
- Unless that optimum happened to satisfy the constraint already, in which case \(\lambda^* = 0\)
Solving this system might be easy or hard, but the solution does give the constrained optimum

Lagrange multipliers are prices

Changing the constraint level \(c\) changes \(\optimand^*\) and \(\ObjFunc(\optimand)\): \[\begin{eqnarray} \frac{\partial \ObjFunc(\optimand^*)}{\partial c} & = & \frac{\partial \Lagrangian(\optimand^*,\lambda^*)}{\partial c}\\ & = & \frac{\partial}{\partial c}\left( \ObjFunc(\optimand^*) + \lambda^*(\Penalty(\optimand^*) - c)\right)\\ & = & \left[\frac{\partial\ObjFunc}{\partial \optimand}(\optimand^*)+\lambda^*\frac{\partial \Penalty}{\partial \optimand}(\optimand^*) \right]\frac{\partial \optimand^*}{\partial c} + \left[\Penalty(\optimand^*)-c\right]\frac{\partial \lambda^*}{\partial c} - \lambda^*\\ & = & [0] \frac{\partial \optimand^*}{\partial c} + [0] \frac{\partial \lambda^*}{\partial c} - \lambda^*\\ & = & -\lambda^* \end{eqnarray}\]
\(\lambda^* =\) Rate at which the optimal value improves as the constraint is relaxed
\(\lambda^* =\) How much would you pay for a marginal change in the level of the constraint, your shadow price for that constraint

Lagrange multipliers vs. penalties

Once we know \(\lambda^*\), we just do the optimization with an extra term: \[\begin{eqnarray} \optimand^* & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand) + \lambda^*(\Penalty(\optimand) - c)}\\ & = & \argmin_{\optimand \in \OptDomain}{\ObjFunc(\optimand) + \lambda^* \Penalty(\optimand)} \end{eqnarray}\]
This is a penalized optimization problem
Lagrange multipliers turns constrained optimization into penalized optimization
- The penalty factor \(\lambda\) corresponds to the constraint level \(c\)
- “A fine is a price”

Many constraints

For multiple equality constraints \(\Penalty_1(\optimand) = c_1\), \(\Penalty_2(\optimand) = c_2\), \(\ldots\) \(\Penalty_k(\optimand) = c_k\), we add \(k\) Lagrange multipliers: \[ \Lagrangian(\optimand, \lambda_1, \ldots \lambda_k) = \ObjFunc(\optimand) + \sum_{j=1}^{k}{\lambda_j (\Penalty_j(\optimand) - c_j)} \]
Each constraint equation gets recovered when we take the derivative w.r.t. that multiplier
Each multiplier tells us our shadow price for loosening each constraint
Equivalently: adding many penalty terms

Inequality constraints

What if the constraint is that \(\Penalty(\optimand) \leq c\)? (not \(=c\))
We add on the Lagrange multiplier anyway, as though it were an equality
Case 1: the global, unconstrained optimum obeys the constraint
- The constraint does not bind or bite
- We should get \(\lambda^* = 0\)
Case 2: the unconstrained optimum is outside in constrained feasible set
- The constraint binds, or is binding, or bites
- The constrained optimum \(\optimand^*\) is a point where \(\Penalty(\optimand) = c\)
- We’ll get \(\lambda \neq 0\)
- (Ignoring some subtleties, which form the “Karush-Kuhn-Tucker theorem”)

Summing up on constraints and Lagrange multipliers

Equality constraints act like penalties
- Penalty factor \(\lambda\) \(\Leftrightarrow\) Lagrange multiplier enforcing the constraint
Loosening the constraints \(\Leftrightarrow\) weakening the penalties
Inequality constraints, like \(\Penalty(\optimand) \leq c\), get treated the same way
- Some multipliers/penalty factors might be 0 if those constraints don’t bite
Penalties and constraints are different ways of looking at the same thing

Mathematical programming

Optimization under constraints is called mathematical programming
- The name goes back to the 1930s and is older than “computer programming”
Linear programming: optimize a linear objective function under linear constraints
- Basically invented by Kantorovich for economic planning in the USSR in the 1930s, re-invented in the West for logistics and decision support in WWII (“operations research”), then adapted to corporate decision making, financial portfolio allocation, etc.
Convex programming: optimize a convex function under convex constraints
- Meaning: take any two points in the feasible set; every point in between them is also in the feasible set
- Includes linear programming as a special case
There are efficient (polynomial-time) algorithms for convex programming problems
- Finding an \(\epsilon\)-approximate optimum over \(p\) variables with \(k\) constraints takes time \(O((k+p)^{3/2} p^2 \log{1/\epsilon})\)

Mathematical programming (2)

The \(L_1\) and \(L_2\) constraints are convex
- \(L_1 =\) all points inside a diamond around the origin
- \(L_2 =\) all points inside a ball around the origin
\(\therefore\) We can use convex-programming algorithms to find the constrained/penalized optimum
- Not needed for ridge/\(L_2\) but very helpful for lasso/\(L_1\)

What do constraints/penalties do to learning and risk?

Constraints reduce the feasible set
- From all of \(\OptDomain\) to \(\Penalty(\optimand) \leq c\)
For learning problems: constraints mean a smaller set of allowable strategies
- From all of \(\ModelClass\) to \(s \in \ModelClass: \Penalty(s) \leq c\)
- Call this sub-set \(\ModelClass_c\)
Smaller strategy space \(\Rightarrow\) higher best-in-class risk, \(\min_{s \in S_c}{\Risk(s)} \geq \min_{s \in S}{\Risk(s)}\)
Smaller strategy space \(\Rightarrow\) lower maximum deviation, \(\max_{s \in S_c}{|\Risk(s) - \EmpRisk(s)|} \leq \max_{s \in S}{|\Risk(s) - \EmpRisk(s)|}\)
Smaller strategy space \(\Rightarrow\) lower Rademacher complexity, growth function, and VC dimension
Constraints \(\Rightarrow\) more approximation error, less estimation error
Since penalties are equivalent to constraints, all of this applies to penalties as well
Since we only care about \((\text{approximation}) + (\text{estimation})\), regularization often helps

Summing up

Optimization problems are often “ill-posed”, “irregular”, unstable
- In learning: often (but not just) from high-dimensional data
We respond by regularizing, either add a penalty to the objective function, or constrain the feasible set
- Constraints and penalties are equivalent via Lagrange multipliers
- Lagrange multiplier \(=\) price for loosening the constraint
Penalty form: just another objective function to minimize
Constraint form: special algorithms (often more computationally efficient)
In statistical learning, two of the most useful penalties/constraints are the \(L_1\) and \(L_2\) penalties on coefficient vectors
- \(L_2\) shrinks towards the origin
- \(L_1\) shrinks and favors sparsity (some coefficients exactly 0)
Regularization increases approximation error but reduces estimation error, so it is often a net advantage in learning

Backup: “Comrades, let’s optimize!”

Kantorovich invented linear programming to help solve economic planning problems in the USSR (Kantorovich 1965)
In the 1950s and especially 1960s, there were serious efforts to use mathematical programming, with computers, to do planning for the whole of the Soviet economy
This was the first time a lot of insanely talented, dedicated and ambitious people tried to use the power of computers, data, and optimization to disrupt / fix the world
A lot of people at the time, not just in the USSR, thought it would succeed
- Many western economists tried very hard to argue that markets were actually as good as optimization-based planning! (Robert Dorfman and Solow 1958)
Spoiler: optimization did not, in fact, lead to Communist utopia
Spufford (2010) is an incredibly good book about this, which I think every aspiring “data scientist” ought to be required to read

References

Kantorovich, L. V. 1965. The Best Use of Economic Resources. Cambrdige, Massachusetts: Harvard University Press.

Robert Dorfman, Paul A. Samuelson, and Robert M. Solow. 1958. Linear Programming and Economic Analysis. New York: McGraw-Hill.

Spufford, Francis. 2010. Red Plenty. London: Faber; Faber.

Regularizing Optimization with Penalties and Constraints

Previously

Think about ordinary least squares

Think about ordinary least squares (2)

Thinking about ordinary least squares (3)

Thinking about ordinary least squares (4)

Thinking about ordinary least squares (5)

Penalties

Penalties

Some pictures

Some pictures (2)

Some pictures (3)

What does the penalty do?

What specifically does the \(L_2\) penalty do?

What about \(L_1\)?

What about \(L_1\)? (2)

Penalties \(\Leftrightarrow\) Constraints

Constrained optimization in general

Lagrange multipliers

Lagrange multipliers (2)

Lagrange multipliers are prices

Lagrange multipliers vs. penalties

Many constraints

Inequality constraints

Summing up on constraints and Lagrange multipliers

Mathematical programming

Mathematical programming (2)

What do constraints/penalties do to learning and risk?

Summing up

Backup: “Comrades, let’s optimize!”

References