Model Averaging

36-465/665, Spring 2021

1 April 2021 (Lecture 17)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Previously

Why are we picking one model class at all?

Model averaging

Why might averaging models help?

Model averaging uses diversity to lower risk

\[ (\mu-\overline{s})^2 = \frac{1}{q}\sum_{i=1}^{q}{(s_i - \mu)^2} - V \]

The math generalizes

Upshot of this math

\[ (\text{risk of ensemble}) = (\text{average individual risk}) - (\text{ensemble diversity}) \]

How do we get many diverse models?

How do we get many diverse models? (2)

What about model complexity?

\[ \overline{s}(x) = \sum_{i=1}^{q}{w_i s_i(x)} \]

(or similar for other kinds of weighted combination)

Why (sensible) model averaging doesn’t massively overfit

Real-data example using bagging of decision trees

Drawbacks to model averaging

Summing up

Backup: Sources / further reading

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24:123–40.

Domingos, Pedro. 1999. “The Role of Occam’s Razor in Knowledge Discovery.” Data Mining and Knowledge Discovery 3:409–25. http://www.cs.washington.edu/homes/pedrod/papers/dmkd99.pdf.

Hong, Lu, and Scott E. Page. 2004. “Groups of Diverse Problem Solvers Can Outperform Groups of High-Ability Problem Solvers.” Proceedings of the National Academy of Sciences 101:16385–9. http://www.cscs.umich.edu/~spage/pnas.pdf.

Kearns, Michael J., and Umesh V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge, Massachusetts: MIT Press.

Krogh, Anders, and Jesper Vedelsby. 1995. “Neural Network Ensembles, Cross Validation, and Active Learning.” In Advances in Neural Information Processing Systems 7 [Nips 1994], edited by Gerald Tesauro, David Tourtetsky, and Todd Leen, 231–38. Cambridge, Massachusetts: MIT Press. https://papers.nips.cc/paper/1001-neural-network-ensembles-cross-validation-and-active-learning.

Page, Scott E. 2007. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton, New Jersey: Princeton University Press. https://doi.org/10.2307/j.ctt7sp9c.

Schapire, Robert E., and Yoav Freund. 2012. Boosting: Foundations and Algorithms. Cambridge, Massachusetss: MIT Press.

Shalizi, Cosma Rohilla. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics 3:1039–74. https://doi.org/10.1214/09-EJS485.