\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]
Some simulated regression data (dots), plus a regression tree fit to the data by growing a big tree without pruning (solid line), and another regression tree fit with the default control settings (dotted line). Notice how the more stable, dotted-line tree misses the outlier (if it is an outlier and not a genuine feature of the data-generating process), but also misses some of the apparent structure of the regression curve (if it is structure and not just seeing patterns in noise).
The bagging procedure is simplicity itself:
Original data for the running example (top left) and three bootstrap resamplings; in each resampling, the full data set is shown in light grey (for comparisons), and the coordinates are slightly “jittered”, so that a repeatedly-sampled point appears as multiple points very close to each other.
Tree fit to the full data (top left), plus the three trees fit to the three bootstrap resamplings from the previous figure.
Original data (top left), plus the same three resamplings, with the regression function estimated by the tree fit to each data set.
Full data (points), plus regression tree fit to them (black), plus the three trees fit to bootstrap resamplings (thin blue lines), plus the average of the three bootstrapped trees, i.e., the bagged model (thick blue line). Notice how the impact of the outlier is attenuated by bagging, but the main features of the unstable-but-sensitive big tree have been preserved, and the bagged curve show more detail than just fitting a stable-but-insensitive tree (dotted black line) — for instance, bagging picks up that the curve rises for small values of \(x\).
Making predictions using a bag of trees is equivalent to making predictions using one giant tree with a huge number of leaves. (Can you prove this, and explain why if each tree has \(r\) leaves, a forest of \(m\) trees is equivalent to one tree with \(O(r^m)\) leaves?) And we know that giant trees should be really unstable, with really high variance. So it might seem that we haven’t gained anything, but we really do get better, more stable predictions from bagging. The trick is that we don’t get our predictions by growing just any giant tree — only trees that arise by averaging many smaller trees are allowable. This however suggests that we should be really careful about what we mean when we say things like “simpler models are usually better”, or even “simpler models are usually more stable” (Domingos 1999b, 1999a).
Variance of averaging \(m\) terms of equal variance \(\sigma^2=1\), each with correlation \(\rho\) with the others. Notice that the variance declines monotonically with \(m\), but, unless \(\rho = 0\), it asymptotes to a non-zero value.
Some practicalities:
Illustration of boosting. Top left: running-example data, plus a regression tree with only three leaves. Top right: Residuals from the first model, plus a three-leaf tree fit to those residuals. Bottom left: residuals from the second model, plus a three-leaf tree fit to those residuals. Bottom right: Original data again, plus the sum of the three estimated trees. In practice, one would use many more than three steps of boosting.
Suppose we have \(m\) variables \(T_1, \ldots T_m\), each with variance \(\sigma^2\), and \(\Cov{T_k, T_l} = \rho_{kl} \sigma^2\). The variance of the average is \[\begin{eqnarray} \Var{\frac{1}{m}\sum_{k=1}^{m}{T_k}} & = & \frac{\sum_{k=1}^{m}{\Var{T_k}}}{m^2} + \frac{1}{m^2}\sum_{k=1}^{m}{\sum_{l\neq k}{\Cov{T_k, T_l}}}\\ & = & \frac{\sigma^2}{m} + \frac{\sigma^2}{m^2}\sum_{k=1}^{m}{\sum_{l\neq k}{\rho_{kl}}}\\ & = & \sigma^2\left(\frac{1}{m} + \frac{m(m-1)}{m^2}\overline{\rho}\right) \end{eqnarray}\] where \(\overline{\rho}\) is the average of all the \(\rho_{kl}\). Since this is a variance, it must be \(\geq 0\), which implies \[\begin{eqnarray} \frac{1}{m} + \frac{m-1}{m}\overline{\rho} & \geq & 0\\ (m-1)\overline{\rho} & \geq & -1\\ \overline{\rho} & \geq & \frac{-1}{m-1} \end{eqnarray}\]
Notes/exercises:
Berk, Richard A. 2008. Statistical Learning from a Regression Perspective. New York: Springer-Verlag.
Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24:123–40.
———. 2001. “Random Forests.” Machine Learning 45:5–32. https://doi.org/10.1023/A:1010933404324.
Domingos, Pedro. 1999a. “Process-Oriented Estimation of Generalization Error.” In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 714–19. San Francisco: Morgan Kaufmann. http://www.cs.washington.edu/homes/pedrod/papers/ijcai99.pdf.
———. 1999b. “The Role of Occam’s Razor in Knowledge Discovery.” Data Mining and Knowledge Discovery 3:409–25. http://www.cs.washington.edu/homes/pedrod/papers/dmkd99.pdf.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second. Berlin: Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/.
Kallenberg, Olav. 2005. Probabilistic Symmetries and Invariance Principles. New York: Springer-Verlag.
Kearns, Michael J., and Umesh V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge, Massachusetts: MIT Press.
Krogh, Anders, and Jesper Vedelsby. 1995. “Neural Network Ensembles, Cross Validation, and Active Learning.” In Advances in Neural Information Processing Systems 7 [Nips 1994], edited by Gerald Tesauro, David Tourtetsky, and Todd Leen, 231–38. Cambridge, Massachusetts: MIT Press. https://papers.nips.cc/paper/1001-neural-network-ensembles-cross-validation-and-active-learning.
Page, Scott E. 2007. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton, New Jersey: Princeton University Press. https://doi.org/10.2307/j.ctt7sp9c.
Schapire, Robert E., and Yoav Freund. 2012. Boosting: Foundations and Algorithms. Cambridge, Massachusetss: MIT Press.