\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \]

Previously

We have a class of models/strategies $\ModelClass$ and a loss function $\Loss$, and we want to learn a good (= low expected loss = low risk) strategy using data $(X_1, Y_1), \ldots (X_n, Y_n)$
Maybe we use the model which minimizes the empirical risk $\hat{s} = \argmin_{s \in \ModelClass}{\EmpRisk(s)}$
- Or maybe we regularize by adding on a penalty, $\min_{s \in \ModelClass}{\EmpRisk(s) + \Penalty(s)}$
- Or maybe we constrain to strategies where $\Penalty(s) \leq c$
- (penalties $\Leftrightarrow$ constraints by Lagrange)
We want to know that $\Risk(\hat{s}) - \EmpRisk(\hat{s})$ isn’t too big, or that $\Risk(\hat{s}) - \Risk(\OptimalModel)$ isn’t too big
- Or similarly for the penalized/constrained estimates
How have we done this?

Capacity of a class of models or functions

In general, infinitely many $s \in \ModelClass$
Also in general, lots of them are very similar
Especially if we’re just considering a limited amount of data
- If we’re doing binary classification with $n$ data points, there are at most $2^n \ll \infty$ different ways of labeling the data, and often $\ll 2^n$ if we can only use functions from $\ModelClass$ to do the labeling
- Similarly for continuous functions if we only care when they are, say, $\epsilon$ apart
Capacity of $\ModelClass =$ how many distinguishable functions/strategies are there in $\ModelClass$, and how does that grow with $n$?
- Precise definitions tailored to various applications

Capacity and model complexity

All our measures of model complexity are really measures of capacity
Rademacher complexity $=$ high ability to (seem to) correlate with random noise; needs lots of very differently-shaped functions in $\ModelClass$

Capacity and model complexity

All our measures of model complexity are really measures of capacity
Rademacher complexity $=$ high ability to (seem to) correlate with random noise; needs lots of very differently-shaped functions in $\ModelClass$

Growth function $=$ worst-case bound on number of distinguishable binary-valued functions; low growth implies low capacity under any distribution
- Covering number works similarly for continuous functions
VC dimension $=$ controls the growth function; low VC dimension implies low capacity under any distribution

What makes these measures of model complexity?

A complex model (or model class) is one which can do many different things, it has lots of functions/strategy which are all very different from each other
- Not talking about whether any particular function is complicated, just about the diversity of available functions
  - Take any arbitrarily hard-to-describe function $f$; if $\ModelClass = \left\{ f, -f\right\}$, then picking the right function should be easy and we can get very tight risk bounds
Complex model classes have higher capacity and looser generalization error bounds

Why would we ever want high complexity models?

Think back to our decomposition of the risk, applied to the ERM $\hat{s}$: \[ \Risk(\hat{s}) = \Risk_0 + (\Risk(\OptimalModel) - \Risk_0) + (\Risk(\hat{s}) - \Risk(\OptimalModel)) \]
- $\Risk_0 =$ true minimum risk (usually not 0)
- $\Risk(\OptimalModel) - \Risk_0 =$ approximation error (due to using $\ModelClass$)
- $\Risk(s) - \Risk(\OptimalModel) =$ estimation error (due to not using $\OptimalModel$)
What happens (again) as we make $\ModelClass$ more complex?
$\Risk(\OptimalModel) = \min_{s \in \ModelClass}{\Risk(s)}$ so a more complex model class reduces approximation error
- $\min$ over a larger set must be $\leq$ $\min$ over a smaller set
$\Risk(\hat{s}) - \Risk(\OptimalModel)$ is controlled by $\max_{s \in \ModelClass}{|\EmpRisk(s) - \Risk(s)|}$
$\Rightarrow$ A more complex model class increases estimation error
- $\max$ over a larger set is $\geq$ $\max$ over a smaller set
- Rademacher complexity etc. directly relate to maximum deviation!

What does regularization do?

Penalties are nicer to implement, but it’s easier to think about constraints
- Again, Lagrange tells us penalties and constraints are equivalent
A constraint “cuts down” the strategy space from $\ModelClass$ to $\left\{ s \in \ModelClass ~ : ~ \Penalty(s) \leq c\right\}$, for short $\ModelClass_{c}$
A non-trivial constraint means $\ModelClass_{c} \subset \ModelClass$
A non-trivial constraint means that $\ModelClass_{c}$ has higher approximation error than $\ModelClass$
- In regression terms: Regularization adds bias
  - Unless we’re very lucky and the best approximation satisfies the constraint…
A non-trivial constraint means that $\ModelClass_{c}$ has lower estimation error than $\ModelClass$
- In regression terms: Regularization removes variance
We only care about the sum of approximation $+$ estimation errors, so regularization can help

How much regularization?

How much approximation error do we add? \[ \min_{s \in \ModelClass_{c}}{\Risk(s)} - \min_{s^{\prime} \in \ModelClass}{\Risk(s^{\prime})} \]
- Involves how much the constraint cuts down the model space, but does not change with $n$
- Roughly, think of this as being $O(\lambda)$ (the shadow price of enforcing the constraint)
How much estimation error do we remove? \[ \left(\max_{s \in \ModelClass_{c}}{|\Risk(s) - \EmpRisk(s)|}\right) - \left(\max_{s^{\prime} \in \ModelClass}{|\Risk(s^{\prime}) - \EmpRisk(s^{\prime})|}\right) \]
- Involves how much the constraint cuts down the model space, but also $n$

Why we should regularize less and less as $n$ grows

We know from our generalization error bounds that \[ \max_{s^{\prime} \in \ModelClass}{|\Risk(s^{\prime}) - \EmpRisk(s^{\prime})|} \leq O\left(\sqrt{\frac{\log{n}}{n}}\right) \]
- (Assuming $\ModelClass$ has shrinking Rademacher complexity etc.)
This will be even smaller for $\ModelClass_c$, but still at the same kind of rate
$\Rightarrow$ the reduction in estimation error from using $\ModelClass_c$ instead of $\ModelClass$ is at most $O\left(\sqrt{\frac{\log{n}}{n}}\right) \rightarrow 0$ as $n\rightarrow\infty$
OTOH the increase in approximation error is $O(1)$ as $n\rightarrow\infty$
Using $\ModelClass_c$ reduces over-all risk for small $n$, but not for large enough $n$

“The method of sieves”

A sieve is a tool with a mesh which catches big things poured through it but lets smaller ones through
The method of sieves says: at each $n$, impose the constraint $\Penalty(s) \leq c_n$, with $c_n \rightarrow \infty$ “slowly enough”
- When $c_n$ is small, we’re using a coarse mesh and only big, crude features of the data get caught by the model
- When $c_n$ is big, we’re using a fine mesh and tiny, delicate features get caught by the model
“Slowly enough”: sum of approximation plus estimation errors should go to 0
- $c_n \rightarrow \infty$ faster: estimation error blows up
- $c_n \rightarrow \infty$ slower: approximation error is bigger than it needs to be

Let’s be a little concrete

Rademacher complexity for a linear model with coefficient vector $\|\beta\| \leq c$ is $\leq \frac{r c}{\sqrt{n}}$
- When $\Prob{\|X\| \leq r} = 1$
An $L_2$ constraint is exactly a constraint on $\|\beta\|$
We know from [Homework 7 Q2] that $\lambda=O(1/c)$ and approximation error is $O(\lambda)$
Estimation error goes like $2\Rademacher_n$
So we want \[ O(1/c_n) + O\left(\frac{r c_n}{\sqrt{n}}\right) \rightarrow 0 \]
Differentiate to minimize: $-O(\frac{1}{c_n^2}) + O(\frac{r}{\sqrt{n}}) = 0$ or $c_n^2 = O(n^{1/2})$ or $c_n=O(n^{1/4})$
- Makes both the approximation and the estimation contributions $O(n^{-1/4})$

Let’s be a little concrete (2)

Dashed: $n$-independent approximation error terms $\propto 1/c$; dotted lines, estimation error terms proportional to $c/\sqrt{n}$; solid lines, sum of approximation and estimation error terms; blue smaller $n$, green larger $n$. Note log scale on horizontal axis to show details. Observe how the minimum for the solid curves ($=$ optimal level of the constraint) is larger at larger sample size.

Summing up

A model class is complex when it contains many strategies which are all very different from each other (“high capacity”)
- Rademacher complexity, the growth function, etc., are ways to measure this
Complex model classes (tend to) have lower approximation error than simpler ones
Learning in complex model classes has larger estimation error than learning in simpler model classes
Regularization makes the model class less complex
- More regularization means more approximation error, less estimation error
As $n$ grows, we usually want less and less regularization
- Approximation error is $O(1)$ as $n$ grows, but estimation error is $O(1/\sqrt{n})$ (up to possible $\log{n}$ factors)
Sieves: relax the constraint on a specific schedule that sends both approximation error and estimation error $\rightarrow 0$
Next time: cross-validation instead of sieves

Backup: Further reading on sieves

The method of sieves, as an explicit technique, including the name, was invented by the late Ulf Grenander
- A very distinguished statistician / applied mathematician who also made important contributions to time series analysis (Grenander and Rosenblatt 1957), to pure probability (“probability on algebraic structures”), and to generative models of visual shape and their application to pattern recognition (Grenander and Manbeck 1993; Grenander, Chow, and Keenan 1991; Grenander 1996)
  - I strongly recommend reading Grenander (1996) as a complement to the kind of learning theory we’re doing in this course
The canonical citation for the method of sieves is Grenander (1981)
I have never read this book and never managed to track down a copy
- The CMU library supposedly has one but it’s listed as unavailable…
I am quite sure (at least) 90% of the citations to it are from people who also have never read it
In the statistics literature, what people actually read was Geman and Hwang (1982), which is short, reasonable clear, and accessible
- Geman and Grenander were both at Brown University…
Geer (2000) brings the method of sieves within the kind of learning theory we’re doing
Why oh why can’t we have a better academic publishing system: Grenander (1981) has been out of print almost since it was first published; it’s not even listed in the publisher’s catalog; used copies start at $250 online; but it’s been cited over 1100 times so there’d be a lot of interest if only it were available to read; but it can’t legally be made available to read because of the publisher (who is making no money from it, nor are Grenander’s heirs)

Backup: “Stochastic complexity”

If we’re using the log-probability loss, it can be useful to define the stochastic complexity of a model class $\ModelClass$ as follows: \[ \mathcal{C}(\ModelClass, n) = \log{\sum_{z_{1:n}}{\max_{s \in \ModelClass}{P(z_{1:n}; s)}}} \]
In words, we take each possible data set (of length $n$), and ask what’s the highest probability we can give that data set, using distributions from $\ModelClass$; we then sum over all possible data sets and take the log
If every data set can be given probability 1 by some distribution in $\ModelClass$, the inner sum is $|\mathcal{Z}|^n$ so the stochastic complexity $\mathcal{C}(\ModelClass, n)$ is $n\log{|\mathcal{Z}|}$
- This is clearly the largest possible value
- Here $|\mathcal{Z}|$ is the number of possible values for a data point $z_i$
This is another measure of capacity, “tuned” to using probability distributions (and the log-probability loss)
The stochastic complexity turns out to control the possible over-fitting in terms of the log loss
If $\ModelClass$ is a $d$-parameter family that obeys some sensible regularity conditions, then $\mathcal{C}(\ModelClass, n) = \frac{d}{2}\log{\frac{n}{2\pi}} + O(1) + o(1)$, where the $O(1)$ term is an integral involving the Hessian of the log-likelihood over the parameter space
On all of this, the best source is Grünwald (2007)

References

Geer, Sara A. van de. 2000. Empirical Processes in M-Estimation. Cambridge, England: Cambridge University Press.

Geman, Stuart, and Chii-Ruey Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” Annals of Statistics 10:401–14. https://doi.org/10.1214/aos/1176345782.

Grenander, Ulf. 1981. Abstract Inference. New York: Wiley.

———. 1996. Elements of Pattern Theory. Baltimore, Maryland: Johns Hopkins University Press.

Grenander, Ulf, Y. Chow, and D. M. Keenan. 1991. Hands: A Pattern Theoretic Study of Biological Shapes. New York: Springer-Verlag.

Grenander, Ulf, and Kevin M. Manbeck. 1993. “A Stochastic Shape and Color Model for Defect Detection in Potatoes.” Journal of Computational and Graphical Statistics 2:131–51. https://doi.org/10.2307/1390696.

Grenander, Ulf, and Murray Rosenblatt. 1957. Statistical Analysis of Stationary Time Series. New York: Wiley.

Grünwald, Peter D. 2007. The Minimum Description Length Principle. Cambridge, Massachusetts: MIT Press.

Model Regularization and Model Complexity

Previously

Capacity of a class of models or functions

Capacity and model complexity

Capacity and model complexity

What makes these measures of model complexity?

Why would we ever want high complexity models?

What does regularization do?

How much regularization?

Why we should regularize less and less as \(n\) grows

“The method of sieves”

Let’s be a little concrete

Let’s be a little concrete (2)

Summing up

Backup: Further reading on sieves

Backup: “Stochastic complexity”

References