Previously

Problem: We want to control \(r(\hat{s}) - \hat{r}(\hat{s})\) (over-fitting, optimism) and \(r(\hat{s}) - r(s^*)\) (estimation error, excess risk, distance to optimum)
Solution: \(\Gamma_n = \max_{s}{|r(s) - \hat{r}(s)|}\) and especially \(\mathbb{E}\left[ \Gamma_n \right]\)
Problem: expected maxima are hard to calculate, and doesn’t it look like we’d need to know \(r(s)\) anyway?
Solution: Rademacher complexity \(\mathcal{R}_n \equiv \mathbb{E}\left[ n^{-1}\max_{s \in S}{|\sum_{i=1}^{n}{\sigma_i \ell(Y_i, s(X_i))}|} \right]\) because \(\mathbb{E}\left[ \Gamma_n \right] \leq 2 \mathcal{R}_n\)
- “How well could we (seem to) correlate with coin tossing?” tells us about how much we might be over-fitting
Problem: still the expectation of a maximum

Three ways forward from Rademacher complexity

Theory: calculations involving the distribution and the models
Data-dependence: Looking at how the models actually perform over the data
Distribution-independence: Using properties of the models that hold in any distribution or data set

Theory

Sometimes we can do probability theory to get at the expectation, or at least bound it
One example (Mohri, Rostamizadeh, and Talwalkar 2012, theorem 4.2, p. 77):

Suppose our \(\mathbb{P}\left( \|X\| \leq r \right) = 1\), and we’re using linear models, so \(s(x) = x\cdot \beta\), with \(\| \beta \| \leq b\). Then \(\hat{\mathcal{R}}_n \leq \frac{rb}{\sqrt{n}}\), so the same bound holds for \(\mathcal{R}_n\)

(We will come back to this particular example later in the course)

Data-dependent bounds

The basic Rademacher complexity bound on generalization error is (for bounded loss \(0\leq \ell\leq m\)) \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + 2\mathcal{R}_n + m \sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
This is a distribution-dependent bound: \(\mathcal{R}\) is an expectation so it changes with the distribution of \((X,Y)\)
One route forward: go from a distribution-dependent bound to a data-dependent bound that involves a random quantity we calculate from the data, e.g. \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + 2\hat{\mathcal{R}}_n + 3m\sqrt{\frac{\log{2/\alpha}}{2n}} \right) \leq \alpha \] or even \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + 2\hat{\hat{\mathcal{R}}}_n + \text{spoiler for HW5Q1} \right) \leq \alpha \]
- Data-dependent bounds don’t have to involve Rademacher complexity (e.g. Lunde and Shalizi (2017) gives one using the bootstrap)

Distribution-free bounds

Look at the definition of Rademacher complexity again, for a general class of functions \(\mathcal{F}\) \[ \mathcal{R}_n \equiv \mathbb{E}\left[ \max_{f \in \mathcal{F}}{\frac{1}{n}\left|\sum_{i=1}^{n}{\sigma_i f(Z_i)}\right|} \right] \]
This is “on average, how well could our functions seem to correlate with binary noise?”
Distribution-dependent because of the “on average”
If we could say “how well could our functions possible seem to correlate with noise?” we’d get something distribution-free
A distribution-free generalization error bound is one where \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + g(n,\alpha, S) \right) \leq \alpha \] and \(g(n, \alpha, S)\) is deterministic and the same for all data-generating distributions
- and \(g(n, \alpha, S) \rightarrow 0\) as \(n\rightarrow\infty\)

Getting a distribution-free bound for classification

Assume \(s(x) = \pm 1\) for all \(x\), and use \(0-1\) loss
- (Means that \(\mathcal{R}\) of loss-and-model is \(1/2\) of \(\mathcal{R}\) of model)
So \(\sigma_i s(X_i) = 1\) if the signs agree and \(=-1\) if the signs disagree
So \(n^{-1} \sum_{i=1}^{n}{\sigma_i s(x_i)} = 1\) when all the signs agree, and \(\approx 0\) if \(\sigma\) and \(s(X_i)\) are independent
- More exactly under independence \(n^{-1} \sum_{i}{\sigma_i s(X_i)}\) is random with mean zero and standard deviation \(O(1/\sqrt{n})\)
So Rademacher complexity will be high if, no matter what the random signs \(\sigma_1, \ldots \sigma_n\) are, we can always find some \(s \in S\) where \(s(x_1), \ldots s(x_n)\) have the same signs
Turn this around: if \(S\) can’t always produce every possible classification, that will limit the Rademacher complexity

Growth function

Continue with \(s(x) = \pm 1\) assumption
The growth function of \(S\), \(\Pi_{S}(n)\), is the maximum number of ways we can classify \(n\) points \(x_1, \ldots x_n\) using functions from \(s\): \[ \Pi_{S}(n) \equiv \max_{x_1, \ldots x_n}{\# \left\{ (s(x_1), \ldots s(x_n)) ~:~ s \in S\right\}} \]
Notice: nothing about distributions in here, just properties of the model class \(S\)
Obvious bound: \(\Pi_S(n) \leq 2^n\)

The growth function upper-bounds the Rademacher complexity

Claim: \[ \mathcal{R}_n(S) \leq \sqrt{\frac{2 \log{\Pi_{S}(n)}}{n}} \]
Proof: HW5
Implication: \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \sqrt{\frac{2 \log{\Pi_{S}(n)}}{n}} + \sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
This is our first distribution-free generalization error bound

It really matters whether the growth function stays exponential

\(\log{\Pi_{S}(n)} \leq n\log{2}\), because \(\Pi_{S}(n) = 2^n\)
If \(\Pi_{S}(n) = 2^n\), then the distribution-free bound becomes \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \sqrt{2\log{2}} + \sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \] which is not useful at all!
If \(\Pi_{S}(n)\) grows at a slower exponential rate than \(2^n\), say \(1.1^n\), we still get a constant-size generalization error bound (also not useful)
If \(\Pi_{S}(n)\) is only polynomial in \(n\), then the bound shrinks fairly, like \(O(\sqrt{\frac{\log{n}}{n}})\)

Shattering and the VC dimension

A set \(\left\{ x_1, \ldots x_n \right\}\) is shattered by \(S\) if we can get any labeling of the points we like by picking the right \(s \in S\)
If \(\Pi_{S}(n) = 2^n\), then there is some set of size \(n\) which can be shattered by \(S\)
If \(\Pi_{S}(n) < 2^n\), then there is no set of size \(n\) which can be shattered by \(S\)
The Vapnik-Chervonenkis (VC) dimension of \(S\) is the largest \(n\) for which \(\Pi_{S}(n) = 2^n\)
\(\mathrm{VCdim}(S) =\) size (cardinality) of the largest set shattered by \(S\)
- If there is no largest shattered set, \(\mathrm{VCdim}(S) = \infty\)
Again: VC dimension is entirely about \(S\), nothing about the distribution of \(X\) or the joint distribution of \((X,Y)\)

Some examples of VC dimension

Using linear classifiers in the plane, we can shatter any set of 3 points, but not 4 points, so \(\mathrm{VCdim}= 3\)
More generally: linear classifiers in \(d\) dimensions can shatter \(d+1\) points but not \(d+2\), so \(\mathrm{VCdim}=d+1\)

Some examples of VC dimension (2)

Convex polygons with \(d\) points: can always shatter up to \(2d+1\) points arranged in a circle, but no more
Sine waves: \(x\) is points on the line, say \(s(x) = 1\) if \(\sin{(\omega x)} \geq 0\) and \(s(x) = -1\) if \(\sin{(\omega x)} \leq 0\): we can shattered arbitrarily large sets of points by adjusting \(\omega\), so \(\mathrm{VCdim}= \infty\)
- One parameter family but infinite VC dimension
People have done this for many, many different kinds of models
- Anthony and Bartlett (1999) and Vidyasagar (2003) are two good references for results like this

Finite VC dimension \(\Rightarrow\) distribution-free bounds

A remarkable fact (proved by V and C with intricate combinatorics):

If \(\mathrm{VCdim}(S) = d < \infty\), then for \(n \geq d\), \[ \Pi_{S}(n) \leq \left( \frac{en}{d} \right)^d = O(n^d) \] while if \(\mathrm{VCdim}(S) = \infty\), then \(\Pi_{S}(n) = 2^n\) for all \(n\)

Growth functions are either exponential at the maximum possible rate, or they are polynomial
Consequence: for \(n \geq d\), \[ \mathbb{P}\left( r(\hat{s}) \geq \hat{r}(\hat{s}) + \sqrt{\frac{2d\log{\frac{en}{d}}}{n}} + \sqrt{\frac{\log{1/\alpha}}{2n}} \right) \leq \alpha \]
Moral: the optimism/over-fitting \(r(\hat{s}) - \hat{r}(\hat{s})\) is \(O(\sqrt{\frac{\log{n/d}}{n/d}})\) (with high probability)
- So is the excess risk/estimation error \(r(\hat{s}) - r(s^*)\)

Note 1: What makes this a “dimension”?

Area of a square is \(r^2\), volume of a cube is \(r^3\), volume of a hyper-cube is \(r^d\)
Volume : Geometric dimension :: Growth function : VC dimension
Even closer analogy: say \(N(\epsilon)\) is the number of cubes of side \(\epsilon\) needed to cover a geometric figure, then in \(d\) dimensions \(N(\epsilon) \simeq (1/\epsilon)^{d}\)
- More precisely \(\lim_{\epsilon \rightarrow 0}{\frac{\log{N(\epsilon)}}{\log{1/\epsilon}}} = d\) (“box-counting dimension”)
Covering number: Geometric dimension :: Growth function : VC dimension

Note 2: VC dimension and falsifiability

Popper (n.d.): what makes a hypothesis scientific is if it could be falsified by observations (we could possibly tell that it was wrong)
Consider the hypothesis “\(Y = s(X)\) for some \(s \in S\)”
If \(\mathrm{VCdim}(S) < \infty\), then this hypothesis is falsifiable, because if \(n > \mathrm{VCdim}(S)\), we could see data which doesn’t match any \(s \in S\)
For you to ponder off-line: if \(\mathrm{VCdim}(S) = \infty\), is this hypothesis un-falsifiable?

VC dimension and uniform convergence

If \(\mathrm{VCdim}(S) < \infty\), then \(\Gamma_n \rightarrow 0\) in all distributions
- Follows from the bounds we’ve done already
Finite VC dimension is sufficient for uniform convergence under all distributions
If \(\mathrm{VCdim}(S) = \infty\), then \(\Gamma_n \not\rightarrow 0\) under at least some distributions
- Considerably more complicated to prove
Finite VC dimension is necessary for uniform convergence under all distributions
N.B., there might still be useful distribution-dependent or even data-dependent bounds, just not distribution-free ones

Beyond binary classifiers with 0-1 loss

If we’re interested in continuous-valued functions, there are ways of extending the notion of growth function and VC dimension
- Basically: look at whether \(s(x) \leq t\) as we vary \(t\), to get binary classifiers from continuous functions (handout on class website)
The growth function and VC dimension bounds have corresponding results for continuous functions
Sub-exponential growth function / finite dimension \(\Rightarrow\) distribution-free bounds

Trade-offs

Between different bounds

Distribution-free bounds tend to be looser than distribution-dependent bounds
Distribution-free bounds tend to match distribution-dependent bounds for the worst possible distribution
- Mythology: the Adversary gets to pick the distribution of \((X,Y)\) after seeing \(S\), but doesn’t get to alter the random draws from that distribution

Between different model classes

Consider two model classes \(S_1 \subset S_2\)
The larger model will generally have better optimal risk, \[ \min_{s \in S_1}{r(s)} \geq \min_{s \in S_2}{r(s)} \]
The larger model will generally also have looser bounds, because \[\begin{eqnarray} \Gamma_n(S_1) & \leq & \Gamma_n(S_2)\\ \mathcal{R}_n(S_1) & \leq & \mathcal{R}_n(S_2)\\ \Pi_{S_1}(n) & \leq & \Pi_{S_2}(n)\\ \mathrm{VCdim}(S_1) & \leq & \mathrm{VCdim}(S_2) \end{eqnarray}\]
So there’s a trade-off between approximation error and estimation error
- More complex models give better approximations but are harder to estimate
- Will revisit when we look at model selection in a few weeks

Summing up

Rademacher complexity gives us distribution-dependent bounds on generalization error and excess risk
Empirical Rademacher complexity, estimated empirical Rademacher complexity, etc., take us to data-dependent bounds
The growth function takes us from Rademacher complexity to distribution-free bounds
The growth function is either polynomial in \(n\), or \(2^n\), depending on whether \(\mathrm{VCdim}\) is finite or infinite
Finite \(\mathrm{VCdim}\) is sufficient for empirical risk minimization to work, and necessary to guarantee it will work for all distributions
We can figure out \(\mathrm{VCdim}\) by geometry and/or combinatorics and/or prior work

References

(Popper’s book is actually from 1934 but R Markdown’s bibliography processor isn’t doesn’t understand how to handle translated works)

Anthony, Martin, and Peter L. Bartlett. 1999. Neural Network Learning: Theoretical Foundations. Cambridge, England: Cambridge University Press.

Lunde, Robert, and Cosma Rohilla Shalizi. 2017. “Bootstrapping Generalization Error Bounds for Time Series.” arxiv:1711.02834. https://arxiv.org/abs/1711.02834.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. Cambridge, Massachusetts: MIT Press.

Popper, Karl R. n.d. The Logic of Scientific Discovery. London: Hutchinson.

Vidyasagar, Mathukumalli. 2003. Learning and Generalization: With Applications to Neural Networks. Second. Berlin: Springer-Verlag.

From Rademacher complexity to VC dimension