Probably Approximately Correct Learning

36-465/665, Spring 2021

9 March 2021 (Lecture 10)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{r}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \]

Previously

Decision theory

Decision theory (2)

From Katsikopoulos et al. (2020), figure 1.3 in section 1.4

Decision theory (3)

Empirical risk minimization and its bounds

Why do we care about \(\Gamma_n\) again?

Bounding the estimation error

A high-probability bound on the estimation error

Low error with high confidence

Risk consistency

Probably Approximately Correct

CS vs. Statistics

Our next steps

Backup: Machine learning as we know it began around 1985

To begin with learning machines: an organized system may be said to be one which transforms a certain incoming message into an outgoing message, according to some principle of transformation. If this principle of transformation is subject to a certain criterion of merit of performance, and if the method of transformation is adjusted so as to tend to improve the performance of the system according to this criterion, the system is said to learn. (Wiener 1964, 14)

Backup: The statistical learning paradigm

Backup: Ways in which one statistical learning system can be better than another

  1. Need fewer data points to get the same \(\epsilon, \alpha\)
  2. Can get smaller \(\epsilon\) with the same \(\alpha\) or vice versa
  3. \(\Risk(\OptimalModel)\) can be smaller, or can be small for a wider range of situations
  4. Doing the optimization to find \(\hat{s}\) is easier

References

Blackwell, David, and M. A. Girshick. 1954. Theory of Games and Statistical Decisions. New York: Wiley.

Katsikopoulos, Konstantinos V., Özgür Şimşek, Marcus Buckmann, and Gerd Gigerenzer. 2020. Classification in the Wild: The Science and Art of Transparent Decision Making. Cambridge, Massachusetts: MIT Press.

Valiant, Leslie G. 1984. “A Theory of the Learnable.” Communications of the Association for Computing Machinery 27:1134–42.

Vapnik, Vladimir N., and Alexey Y. Chervonenkis. 1971. “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and Its Applications 16:264–80.

Wiener, Norbert. 1964. God and Golem, Inc.: a Commentary on Certain Points Where Cybernetics Impinges Upon Religion. Cambridge, Massachusetts: MIT Press.