\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \]

Previously

Decision theory, strategies, risks
Empirical risk minimization, large-sample results about ERM, optimism and over-fitting
Probability theory: deviation bounds, convergence of sample averages to expectations, uniform convergence
Bounding the maximum deviation with Rademacher complexity, distribution- and data- dependent generalization error bounds
Bounding the maximum deviation with VC dimension, distribution-free bounds
What’s all this aiming at again?

Decision theory

State \(Y\), actions \(A\), loss function \(\Loss(y,a) =\) how much does it hurt to have done \(a\) in state \(y\)?
Strategy \(s(x) =\) which action in \(A\) should we take on information \(x\)?

Decision theory (2)

From Katsikopoulos et al. (2020), figure 1.3 in section 1.4

Decision theory (3)

State \(Y\), actions \(A\), loss function \(\Loss(y,a) =\) how much does it hurt to have done \(a\) in state \(y\)?
Strategy \(s(x) =\) which action in \(A\) should we take on information \(x\)?
Class of strategies or models \(\ModelClass\)
Risk of \(s\) is expected loss, \(\Risk(s) \equiv \Expect{\Loss(Y, s(X))}\)
The optimal strategy within the class is \(\OptimalModel = \argmin_{s \in \ModelClass}{\Risk(s)}\)
Prediction is a special case
- actions are guesses about \(Y\)
- loss functions should track accuracy (did the guess come true? approximately true?)
- loss functions often track precision (was the guess very exact or very vague?)

Empirical risk minimization and its bounds

Empirical risk \(\EmpRisk(s) = n^{-1}\sum_{i=1}^{n}{\Loss(Y_i, s(X_i))}\)
Deviation \(\EmpRisk(s) - \Risk(s)\)
Maximum deviation \(\Gamma_n = \max_{s \in \ModelClass}{|\EmpRisk(s) - \Risk(s)|}\)
Rademacher complexity \(\Rademacher_n = \Expect{\max_{s \in \ModelClass}{\left| \frac{1}{n}\sum_{i=1}^{n}{\sigma_i \Loss(Y_i, s(X_i))}\right|}}\)
\(\Expect{\Gamma_n} \leq 2 \Rademacher_n\)
VC dimension \(d =\) size of largest set shattered by \(\ModelClass\)
\(\Rademacher_n \leq \sqrt{\frac{2d\log{\frac{en}{d}}}{n}}\)

Why do we care about \(\Gamma_n\) again?

Last time, emphasized using \(\Gamma_n\) as a bound on \(|\EmpRisk(\hat{s}) - \Risk(s)|\)
But (maybe) more important is the estimation error or excess risk: \[ \Risk(\hat{s}) - \Risk(\OptimalModel) \leq 2\Gamma_n \]
Control of \(\Gamma_n\) gives us control on the estimation error

Bounding the estimation error

A big excess risk implies a big maximum deviation: If \(\Risk(\hat{s}) - \Risk(\OptimalModel) \geq \epsilon_1\), then \(2\Gamma_n \geq \epsilon_1\) or \(\Gamma_n \geq \epsilon_1/2\)
If \(0 \leq \Loss \leq m\), then \[ \Prob{\Gamma_n \geq \Expect{\Gamma_n} + \epsilon_2} \leq \myexp{-2n\epsilon_2^2/m^2} \]
A big enough excess risk implies a bigger-than-expected maximum deviation:
- Set \(\epsilon_1 = 2\Expect{\Gamma_n} + 2\epsilon_2\)
- Then \(\Risk(\hat{s}) - \Risk(\OptimalModel) \geq \epsilon_1\) implies \(\Gamma_n \geq \Expect{\Gamma_n} + \epsilon_2\)
Conclusion: \begin{eqnanarray} & & \ & & \ & = & \end{eqnarray}

A low probability of a large excess risk
- “large” compared to \(2\Expect{\Gamma_n}\)
Notice: For each \(\epsilon > 0\), this probability \(\rightarrow 0\) as \(n\rightarrow \infty\)
Notice: If \(\Expect{\Gamma_n} \rightarrow 0\), then the estimation error gets arbitrarily small with high probability

A high-probability bound on the estimation error

As in the exercise for lecture 7, pick an \(\alpha \in (0,1)\) and then \[ \Prob{\Risk(\hat{s}) \geq \Risk(\OptimalModel) + 2\Expect{\Gamma_n} + m\sqrt{\frac{2\log{1/\alpha}}{n}}} \leq \alpha \]
\(m\sqrt{\frac{2\log{1/\alpha}}{n}} \rightarrow 0\) as \(n\rightarrow \infty\)
If \(\Expect{\Gamma_n} \rightarrow 0\) as well then with high (\(1-\alpha\)) confidence \(\Risk(\hat{s})\) is converging on \(\Risk(\OptimalModel)\) from above

Low error with high confidence

Pick any \(\epsilon > 0\)
Pick any \(\alpha \in (0,1)\)
If \(n \geq N_1(\alpha, \epsilon/2) = \frac{32m^2 \log{1/\alpha}}{\epsilon^2}\) then \(m\sqrt{\frac{2\log{1/\alpha}}{n}} \leq \epsilon/2\)
\(\Expect{\Gamma_n} \rightarrow 0\) means: for any \(\eta >0\), if \(n\geq N_2(\eta)\), then \(\Expect{\Gamma_n} \leq \eta\)
- Example: if \(\Expect{\Gamma_n} \leq \frac{c_1}{\sqrt{n}}\) then \(N_2(\eta) = (c_1/\eta)^2\)
Implication: If \(n \geq \max{N_1(\alpha,\epsilon/2), N_2(\epsilon/4)}\) then \[\begin{eqnarray} \Prob{\Risk(\hat{s}) \geq \Risk(\OptimalModel) + \epsilon} & = & \Prob{\Risk(\hat{s}) \geq \Risk(\OptimalModel) + 2\epsilon/4 + \epsilon/2} \\ & \leq & \Prob{\Risk(\hat{s}) \geq \Risk(\OptimalModel) + 2\Expect{\Gamma_n} + m\sqrt{\frac{2\log{1/\alpha}}{n}}}\\ & \leq & \alpha \end{eqnarray}\]
Turned around: for any \(\epsilon, \alpha\), if \(n \geq \max{N_1(\alpha, \epsilon/2), N_2(\epsilon/4)}\) then \[ \Prob{\Risk(\hat{s}) < \Risk(\OptimalModel) + \epsilon} \geq 1-\alpha \]

Risk consistency

An estimator \(\hat{\theta}_n\) of a parameter \(\theta\) is consistent when for any \(\epsilon > 0\) and any \(\alpha \in (0,1)\), \(\Prob{|\hat{\theta}_n - \theta| \geq \epsilon} \leq \alpha\) for \(n \geq N(\alpha, \epsilon)\)
- With enough data, we can be as confident as we like of coming arbitrarily close to the truth
A way of selecting strategies is risk-consistent when for any \(\epsilon\) and any \(\alpha \in (0,1)\), \(\Prob{\Risk(\hat{s}) - \Risk(\OptimalModel) \geq \epsilon} \leq \alpha\)
- (Why do we not need absolute values here?)
- An old notion in statistics, see e.g. Blackwell and Girshick (1954)
We’ve just shown that ERM is risk-consistent, provided \(\Expect{\Gamma_n} \rightarrow 0\)
Which is implied by \(\Rademacher_n \rightarrow 0\)
Which is implied by \(\VCD < \infty\)

Probably Approximately Correct

Valiant (1984) defined probably approximately correct learning for classifiers or concepts as follows
Assume \(0-1\) loss, assume \(Y=\OptimalModel(X)\)
- meaning: one rule in \(\ModelClass\) is exactly, absolutely right
The concept is PAC-learnable if for any \(\epsilon, \alpha\), there’s an algorithm which gives a \(\hat{s}\) where \[ \Prob{\Risk(\hat{s}) < \epsilon} \geq 1-\alpha \] for \(n \geq N(\epsilon, \alpha)\)
Efficient PAC learning: \(N(\epsilon, \alpha)\) is polynomial or better in \(1/\epsilon\) and \(1/\alpha\)
Computationally efficient PAC learning: Efficient PAC learning and the algorithm runs in time polynomial in \(n\)
Agnostic PAC learning: maybe \(Y\) isn’t exactly equal to \(s(X)\) for any \(s \in \ModelClass\); then \(\OptimalModel\) is the one with the lowest expected mis-classification, and we want \[ \Prob{\Risk(\hat{s}) < \Risk(\OptimalModel) + \epsilon} \geq 1-\alpha \]

CS vs. Statistics

Statisticians in general to CS: Agnostic PAC learning is risk-consistency under 0-1 loss
Statisticians: Also, why would you ever think one of our models is exactly right? The “agnostic” case is the realistic one!
V. N. Vapnik to everyone: Efficient agnostic PAC learning is Vapnik and Chervonenkis (1971)
In defense of CS: “Probably approximately correct” is a much better and clearer name than “consistent” or “risk-consistent”
In defense of CS: Computational considerations are very real
- Proving ERM will work great isn’t useful if finding the minimum takes \(\myexp{n}\) steps

Our next steps

We’ve said most of what we’re going to about purely statistical parts of learning
- Except for applications to particular systems (later)
We’re going to open up the box labeled \(\argmin\) and look a bit at how optimization actually happens, computationally
Fundamental optimization algorithms
What makes some optimization problems easier than others
Optimizing things other than the empirical risk
Using properties of the optimization process to further bound the risk

Backup: Machine learning as we know it began around 1985

People had been thinking about machines that learn since the dawn of computers:

To begin with learning machines: an organized system may be said to be one which transforms a certain incoming message into an outgoing message, according to some principle of transformation. If this principle of transformation is subject to a certain criterion of merit of performance, and if the method of transformation is adjusted so as to tend to improve the performance of the system according to this criterion, the system is said to learn. (Wiener 1964, 14)

We can still agree with this…
- “Transformation” \(\Leftrightarrow\) strategy \(s\) takes input information \(x\) and delivers output action \(s(x)\)
- “Criterion of merit”: loss function, \(\Loss(y, s(x))\)
But people didn’t have clear, precise ideas about how to evaluate learning or what constituted learning or how to tell when one learning system was better than another, or trade-offs between different kinds of good learning
After PAC (and VC and…), everything crystallized and “machine learning” could become a coherent discipline

Backup: The statistical learning paradigm

Given: \(X\), \(Y\), actions \(A\), loss function \(\Loss\)
Also given: family of strategies \(\ModelClass\)
Also given: data \((x_1, y_1), \ldots (x_n, y_n)\)
Pick \(\hat{s}\) from \(\ModelClass\) using the data
- For us so far: \(\hat{s} = \argmin_{s \in \ModelClass}{\EmpRisk(s)}\)
- Almost: solve some optimization problem or another
Estimate \(|\EmpRisk(\hat{s}) - \Risk(\hat{s})|\) with confidence
Estimate \(\Risk(\hat{s}) - \Risk(\OptimalModel)\) with confidence
Learning means: \(\hat{s}\) is probably approximately correct, \(\Prob{\Risk(\hat{s}) - \Risk(\OptimalModel) \geq \epsilon} \leq \alpha\) if \(n \geq N(\epsilon, \alpha)\)

Backup: Ways in which one statistical learning system can be better than another

Need fewer data points to get the same \(\epsilon, \alpha\)
Can get smaller \(\epsilon\) with the same \(\alpha\) or vice versa
\(\Risk(\OptimalModel)\) can be smaller, or can be small for a wider range of situations
Doing the optimization to find \(\hat{s}\) is easier

1. and (2) push towards simple, low-capacity \(\ModelClass\)
- Low Rademacher complexity, low VC dimension, etc.
1. pushes towards rich, flexible \(\ModelClass\)
1. pushes to choosing \(\Loss\) and \(\ModelClass\) which play nicely with each other and with known optimization techniques
- E.g. \(0-1\) loss is very “combinatorial” and hard to optimize compared to log-loss

References

Blackwell, David, and M. A. Girshick. 1954. Theory of Games and Statistical Decisions. New York: Wiley.

Katsikopoulos, Konstantinos V., Özgür Şimşek, Marcus Buckmann, and Gerd Gigerenzer. 2020. Classification in the Wild: The Science and Art of Transparent Decision Making. Cambridge, Massachusetts: MIT Press.

Valiant, Leslie G. 1984. “A Theory of the Learnable.” Communications of the Association for Computing Machinery 27:1134–42.

Vapnik, Vladimir N., and Alexey Y. Chervonenkis. 1971. “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and Its Applications 16:264–80.

Wiener, Norbert. 1964. God and Golem, Inc.: a Commentary on Certain Points Where Cybernetics Impinges Upon Religion. Cambridge, Massachusetts: MIT Press.

Probably Approximately Correct Learning