\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \]

Previously

Selecting one model class \(\ModelClass_k\) out of \(q\) different model classes:
- Cross-validation, holdout
- Structural risk minimization
- Other penalties
Results like: with high probability, the risk of the ERM in the selected class, \(\Risk(\hat{s}_{\hat{k}})\), is close to the risk of the best ERM, \(\min_{k \in 1:q}{\Risk(\hat{s}_k)}\)
- Or, even better, \(\Risk(\hat{s}_{\hat{k}}) \rightarrow \min_{k \in 1:\infty}{(\min_{s \in \ModelClass{k}}{\Risk(s)})}\) as \(n\rightarrow\infty\)
The general cartoon:

Why are we picking one model class at all?

Which model class we select is going to be (somewhat) random
Often there seem to be a bunch of model classes which all perform pretty similarly
Why force ourselves to arbitrarily pick one model class, adding randomness?
Could we somehow use more than one model class?

Model averaging

Every model class \(\ModelClass_k\), or even every model \(s\), gets a weight
- The set of models we’re combining is called the ensemble
Every model in the ensemble looks at \(X\) and recommends an action \(s(X)\)
We do a weighted combination of all the model recommendations
- For regression or other real-number actions: weighted combination \(=\) weighted average
- For classifiers: weighted combination \(=\) voting with shares \(\propto\) weights
- If there’s no other way to combine, we can always pick a random recommended action with probabilities \(\propto\) weights
Note: we never care about averaging model parameters, only model outputs

Why might averaging models help?

Math is easiest for regression (but the point is general)
We’re trying to estimate \(\mu\), and we have (for simplicity) two estimates, \(s_1\) and \(s_2\)
- Our average estimate is \(\overline{s} \equiv \frac{s_1+s_2}{2}\)
- The variance around this average estimate is \(V\) \[\begin{eqnarray} V & \equiv &\frac{1}{2}\left[ (s_1 - \overline{s})^2 + (s_2 - \overline{s})^2\right]\\ & = & \frac{1}{2}\left[ (s_1 -\mu + \mu-\overline{s})^2 + (s_2 - \mu + \mu - \overline{s})^2 \right]\\ &= & \frac{1}{2}\left[ (s_1 - \mu)^2 + (s_2 - \mu)^2 + 2(\mu-\overline{s})^2 + (2 (s_1 - \mu) + 2(s_2-\mu))(\mu-\overline{s})\right]\\ & = & \frac{1}{2}\left[(s_1 - \mu)^2 + (s_2 - \mu)^2 + 2(\mu-\overline{s})^2 + 2 (s_1+s_2 - 2\mu)(\mu-\overline{s})\right]\\ & = & \frac{1}{2}\left[(s_1 - \mu)^2 + (s_2 - \mu)^2 + 2(\mu-\overline{s})^2 + 4 (\overline{s} - \mu )(\mu-\overline{s})\right]\\ & = & \frac{1}{2}\left[(s_1 - \mu)^2 + (s_2 - \mu)^2 - 2 (\mu-\overline{s})^2\right]\\ & = & \frac{(s_1 - \mu)^2 + (s_2 - \mu)^2}{2} - (\mu-\overline{s})^2\\ (\mu-\overline{s})^2 & = & \frac{(s_1 - \mu)^2 + (s_2 - \mu)^2}{2} - V \end{eqnarray}\]
In fact, with \(q\) different models (exercise!) \[ (\mu-\overline{s})^2 = \frac{1}{q}\sum_{i=1}^{q}{(s_i - \mu)^2} - V \]

Model averaging uses diversity to lower risk

\[ (\mu-\overline{s})^2 = \frac{1}{q}\sum_{i=1}^{q}{(s_i - \mu)^2} - V \]

The final left-hand side the squared error of the average estimator
- The RHS is the averaged squared error of the \(q\) different estimators…
- … minus the variance of the estimators
- All else being equal, better individual models will improve the ensemble
- All else being equal, more diverse individual models will improve the ensemble
In a slogan, \[ (\text{performance of group}) = (\text{average individual performance}) + (\text{group diversity}) \]

The math generalizes

Using the squared error loss function, if \(\overline{s}\) averages predictions of models \(s_1, \ldots s_q\), \[ \Risk(\overline{s}) = \frac{1}{q}\sum_{i=1}^{q}{\Risk(s_i)} - V \]
- Here \(V\) is variance of model predictions (again, not parameters)
Weighted averages work exactly the same
- \(V\) has to be the weighted variance in predictions across models
Also works for loss functions other than squared error, if they have a bias-variance-ish decomposition the way squared error does
- e.g., it’s true for the log probability loss: \[ \Risk(\overline{s}) = \sum_{i=1}^{q}{w_i \Risk(s_i)} - \sum_{i=1}^{q}{w_i \Expect{\log{\frac{s_i(X)}{\overline{s}(X)}}}} \leq \sum_{i=1}^{q}{w_i \Risk(s_i)} \]

Upshot of this math

\[ (\text{risk of ensemble}) = (\text{average individual risk}) - (\text{ensemble diversity}) \]

Making the ensemble bigger than the one best model does lower average performance
but it also increases diversity, and that can more than compensate
- yet another bias-variance trade-off
Model averaging will only help if the models in the ensemble make different predictions / recommendations
Including really bad models to increase diversity is unlikely to help
- Because: it raises the average risk more than it increases diversity
Superficially different models that make almost exactly equal predictions won’t help much
- We never care about parameters, only about predictions

How do we get many diverse models?

Fit all your models to the \(n\) data points, but make the weight on model \(s_i\) \(\propto \myexp{-n \beta \EmpRisk(s_i)}\) (hedging, exponential weighting)
- \(\beta=0\) is just “weight all models equally”
- Larger \(\beta\) means “concentrate on the best-fitting ones”
- Replace \(\EmpRisk\) by \(\EmpRisk +\) penalty if you like
- If we’re using the log probability loss and \(\beta=1\), this is Bayesian model averaging (and it makes the diversity go away as \(n\) grows (Shalizi 2009))
- We’ll come back to this when we look at “low-regret” learning
Take the \(n\) data points, randomly resample \(n\) time with replacement (i.e., bootstrap), fit a model to the resample, repeat \(q\) times, give equal weights
- This is “bootstrap averaging” or bagging (Breiman 1996)

How do we get many diverse models? (2)

Take the \(n\) data points, randomly divide into \(q\) folds of size \(n/q\), and train a different model on each fold, give every model equal weight
- This is what used to be called “boosting” (see e.g. Kearns and Vazirani (1994))
Fit a model to all \(n\) data points, say \(\hat{s}_1\), then weight the data points by \(\Loss(z_i, \hat{s}_1)\) and train model \(\hat{s}_2\) on the weighted data, etc., until tired
- This is what people now mean by “boosting” (see e.g. Schapire and Freund (2012))
Etc., etc.

What about model complexity?

\[ \overline{s}(x) = \sum_{i=1}^{q}{w_i s_i(x)} \]

(or similar for other kinds of weighted combination)

\(\overline{s}\) isn’t (necessarily) in any of the “base” model classes \(\ModelClass_i\), but in some bigger, more complex class that includes them all as sub-classes
Easiest to see if the \(s_i\) are decision trees…
- If each tree \(s_i\) has \(\approx r\) leaves, then \(\overline{s}\) acts like one tree with \(\approx r^q\) leaves
… but also true for other model families
Fitting \(\overline{s}\) is like fitting a single model from the really big space of possible combinations of \(\ModelClass_1, \ModelClass_2, \ldots \ModelClass_q\), call this \(\ModelClass_{\text{comb}}\)
How does this not overfit?

Why (sensible) model averaging doesn’t massively overfit

We’re not searching over all of \(\ModelClass_{\text{comb}}\)
We’re only searching over the part of it we get by averaging together \(q\) models which are individually fit to the data
- Remember we want low average risk for the models in the ensemble
Implies that there’s a lot of algorithmic stability
Alternatively: the real \(\ModelClass_{\text{comb}}\) is a lot smaller than it looks

Real-data example using bagging of decision trees

(Infamous) “COMPAS” data, about predicting the odds that someone who was arrested (in Brouward County, Florida) would be re-arrested for a violent offense within 2 years
5 bagged trees, plus the tree fit to the whole data in the bottom right

The trees look similar:
- \(\uparrow\) prior convictions (for anything) \(\Rightarrow\) \(\uparrow\) prob. of re-arrest for violence
- \(\downarrow\) age \(\Rightarrow\) \(\uparrow\) prob. of re-arrest for violence
- being arrested for a felony (vs. misdemeanor) \(\Rightarrow\) \(\uparrow\) prob. of re-arrest for violence (some trees only)
- females (vs. male) \(\Rightarrow\) \(downarrow\) prob. of re-arrest for violence (a few trees only)
- There were a lot of other features which all the trees ignored
The actual predictions across trees are fairly similar

Drawbacks to model averaging

Computational: You need to keep around a lot of models and run them all every time you make a prediction / take an action
Statistical: There might not be a lot of diversity among your good models, so you’d be better off with the best model
- This is especially hard when weights are exponential
Practical: Good luck explaining why the ensemble acted the way it did

Summing up

Selecting one model is random and noisy
Averaging over an ensemble of models is less random and less noisy
Ensemble risk \(=\) average risk of models in ensemble \(-\) diversity of ensemble
We want to average models which are good but different
Models which are very different from all the others can help the ensemble even if they are (slightly) worse
Model averaging is more computationally expensive and harder to interpret than model selection

Backup: Sources / further reading

The decomposition of MSE for an ensemble or average into the average MSE minus the variance across the ensemble follows Krogh and Vedelsby (1995)
I learned to think about diversity as a way of improving group performance from Scott Page; see Hong and Page (2004);Page (2007)
I learned the points about why ensembles don’t overfit from Domingos (1999)

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24:123–40.

Domingos, Pedro. 1999. “The Role of Occam’s Razor in Knowledge Discovery.” Data Mining and Knowledge Discovery 3:409–25. http://www.cs.washington.edu/homes/pedrod/papers/dmkd99.pdf.

Hong, Lu, and Scott E. Page. 2004. “Groups of Diverse Problem Solvers Can Outperform Groups of High-Ability Problem Solvers.” Proceedings of the National Academy of Sciences 101:16385–9. http://www.cscs.umich.edu/~spage/pnas.pdf.

Kearns, Michael J., and Umesh V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge, Massachusetts: MIT Press.

Krogh, Anders, and Jesper Vedelsby. 1995. “Neural Network Ensembles, Cross Validation, and Active Learning.” In Advances in Neural Information Processing Systems 7 [Nips 1994], edited by Gerald Tesauro, David Tourtetsky, and Todd Leen, 231–38. Cambridge, Massachusetts: MIT Press. https://papers.nips.cc/paper/1001-neural-network-ensembles-cross-validation-and-active-learning.

Page, Scott E. 2007. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton, New Jersey: Princeton University Press. https://doi.org/10.2307/j.ctt7sp9c.

Schapire, Robert E., and Yoav Freund. 2012. Boosting: Foundations and Algorithms. Cambridge, Massachusetss: MIT Press.

Shalizi, Cosma Rohilla. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics 3:1039–74. https://doi.org/10.1214/09-EJS485.

Model Averaging