Bootstrap
36-462/662, Spring 2020
2 April 2020 (Lecture 21)
The Big Picture
\[
\newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)}
\newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)}
\]
- Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
- The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
- Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
- Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
- Stronger assumptions \(\Rightarrow\) less uncertainty if we’re right
Statistical Uncertainty
- Re-run the experiment (survey, census, …) and get different data
- \(\therefore\) everything we calculate from data (estimates, test statistics, \(p\)-values, policy recommendations, …) will change from run to run
- This variability is (the source of) statistical uncertainty
- Quantifying this = honesty about what we actually know
Measures of Uncertainty
- Standard error = standard deviation of an estimator
- (could equally well use median absolute deviation, etc.)
- \(p\)-value = Probability we’d see a signal this big if there was just noise
- Confidence region = All the parameter values we can’t reject at low error rates
- Either the true parameter is in the confidence region
- or we are very unlucky
- or our model is wrong
- etc., etc.
The Sampling Distribution Is the Source of All Knowledge
- Data \(X \sim P_X\) for some unknown true distribution \(P_X\)
- We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T}\)
- If we knew \(P_{T}\), we could calculate
- \(\Var{T}\) (and so standard error)
- \(\Expect{T}\) (and so bias)
- quantiles (and so confidence intervals or \(p\)-values), etc.
The Difficulties
- Difficulty 1: Most of the time, \(P_{X}\) is very complicated
- Difficulty 2: Most of the time, \(\tau\) is a very complicated function
- \(\therefore\) We couldn’t solve for \(P_T\)
- Difficulty 3: Actually, we don’t know \(P_X\)
- Upshot: We really don’t know \(P_{T}\) and can’t use it to calculate anything
The Solutions
Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model
The Monte Carlo Principle
- Generate a simulate \(\tilde{X}\) from \(P_X\)
- Set \(\tilde{T} = \tau(\tilde{X})\)
- Repeat many times
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T}\)
- (As a general method, invented by Enrico Fermi in the 1930s, spread through the Manhattan Project)
- Still needs \(P_X\)
The Bootstrap Principle
- Find a good estimate \(\hat{P}\) for \(P_{X}\)
- Generate a simulation \(\tilde{X}\) from \(\hat{P}\), set \(\tilde{T} = \tau(\tilde{X})\)
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T}\)
- “Pull yourself up by your bootstraps”: use \(\hat{P}\) to get at uncertainty in itself
- Invented by Bradley Efron in the 1970s
- First step: find a good estimate \(\hat{P}\) for \(P_{X}\)
Model-based Bootstrap
If we are using a model, our best guess at \(P_{X}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters
The Model-based Bootstrap
- Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
- Repeat \(b\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/“shape” as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
- Use empirical distribution of \(\tilde{T}\) as \(P_{T}\)
Resampling
- Difficulty: We might not have a trust-worthy model
- This is usually the case in data mining
- Resource: We do have data, which tells us a lot about the distribution
- We usually have a lot of data to mine
- Solution: Resampling, treat the sample like a whole population
The Resampling (“Nonparameteric”) Bootstrap
- Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
- Repeat \(b\) times:
- Generate \(\tilde{X}\) by drawing \(n\) samples from \(X\) with replacement (resample the data)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)
- Use empirical distribution of \(\tilde{T}\) as \(P_{T}\)
Bootstrapping Regressions
- A regression is a model for \(Y\) conditional on \(X\) \[
Y= \mu(X) + \epsilon, ~ \Expect{\epsilon|X} = 0
\]
- This is silent on the distribution of \(X\), so how do we simulate?
- Options, putting less and less trust in the model:
- Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{\mu}(x_i) + \tilde{\epsilon}_i\) from model’s estimated noise distribution (e.g., Gaussian or \(t\))
- Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{\mu}(x_i) + \tilde{\epsilon}_i\), noise resampled from the residuals
- Resample \((x_i, y_i)\) pairs (resample data-points or resample cases)
Different Regression Bootstraps
- Resampling residuals works as long as the noise is IID
- Noise could be Gaussian…
- Or noise could be very non-Gaussian
- Noise does need have same distribution everywhere
- Dubious unless the regression model is right
- Resampling whole cases works as long as observations are IID
- noise needn’t be independent of \(X\)
- needn’t be Gaussian
- regression model needn’t be right
What About Classification?
- If the model predicts \(\Prob{Y=1|X=x}\), then we can use that to generate new \(Y\) values
- This is like simply simulating a regression model with its specified noise distribution
- We can also resampling whole cases
- There isn’t such a good counterpart to resampling residuals
Using the Bootstrap to Improve Prediction
- Bootstrap averaging, or bagging
- Draw a bootstrap sample and fit the model to it, getting say \(\tilde{M}\)
- Repeat \(b\) times to get \(\tilde{M}_1, \ldots \tilde{M}_b\)
- At a new point \(x_0\), predict the average of \(\tilde{M}_1(x_0), \ldots \tilde{M}_b(x_0)\)
- Doesn’t help if the model is linear in the data
- Can help for nonlinear models (trees, nearest neighbors, kernel methods, neural networks…)
- Random forests: bagging for trees, with one twist
- Every time we’re considering a split, we look at a random subset of the features (sometimes \(\sqrt{p}\) of the features, sometimes just 1)
- Increases the diversity of the trees across the ensemble
- Doesn’t apply to techniques other than trees, because they don’t have a notion of “split on this variable at this step”
Bootstrap Estimates of Risk
- Obvious approach:
- Estimate the model on a bootstrap sample
- The calculate the average loss on the full sample
- Repeat \(b\) times and average
- Drawback: every point in the training set was also in the testing set, so this is optimistically biased
- Better: evaluate the model on points not in its bootstrap sample
- Even better: for each point \(i\), calculate the average error of predicting \(i\) when the bootstrap sample didn’t include that point, and then average over data points (drop-one bootstrap)
- Even more complicated approaches are possible (Hastie, Tibshirani, and Friedman 2009, sec. 7.11)
Yet Another Bootstrap Scheme
(After Lunde (2018); Lunde and Shalizi (2017))
- Simulate a training set of size \(n\) and a test set of size \(N \gg n\)
- Find the average loss on the training set, the average loss on the test set, and the difference between them \(\tilde{\gamma}\)
- Repeat that \(b\) times, getting \(\tilde{\gamma}_1, \ldots \tilde{\gamma}_b\)
- Find the \(\alpha\) percentile of the distribution of \(\tilde{\gamma}\), call this \(\tilde{\delta}_{\alpha}\)
- Fit the model to the full real data set, with empirical risk \(\hat{r}\)
- With high probability (\(\approx \alpha\)), the true risk \(\leq \hat{r} + \tilde{\delta}_{\alpha}\)
Sources of Error in Bootstrapping
- Simulation Using only \(b\) bootstrap replicates
- Make this small by letting \(b\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{b})\)
- There are tricks for speeding up simulations, re-using parts of them, etc.
- Approximation Using \(\hat{P}\) instead of \(P_{X}\)
- Make this small by careful statistical modeling
- Estimation Only a finite number of samples
- Make this small by being careful about what we simulate (example at end of slides)
- For fixed \(n\), resampling generally shows more uncertainty than model-based bootstrap
- Resampling is less vulnerable to modeling mistakes
Summing Up
- We need to know the sampling distribution of our statistic \(T = \tau(X)\)
- If we could simulate \(\tilde{X}\) from the true distribution of \(X\), we’d use the distribution of \(\tau(\tilde{X})\) as the sampling distribution (“Monte Carlo”)
- We simulate from an estimate of the distribution of \(X\) instead (“bootstrap”)
- The simulation could be based on a model
- Or on re-sampling the observations
- Or some combination
Backup: Improving on the Crude Confidence Interval
Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)
(“centering” or “pivoting”)
Backup: The Basic, Pivotal CI
quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)
\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
& \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]
Basically: re-center the simulations around the empirical data
Backup: Extended example of model-based bootstrapping, with cats
Example: Is Karakedi overweight?

Example (cont’d.)
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
## [1] 3.521869
Example (cont’d.)
Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function
Example (cont’d.)
Simulate, plot the sampling distribution from the simulations

Example (cont’d.)
Find standard error and a (crude) confidence interval
## [1] 0.06010144
## 2.5% 97.5%
## 3.399080 3.628943
Model Checking
As always, if the model isn’t right, relying on the model is asking for trouble
Is the Gaussian a good model for cats’ weights?
Example (re-cont’d.)
Compare histogram to fitted Gaussian density and to a smooth density estimate

Example, Take 2
Model-free estimate of the 95th percentile = 95th percentile of the data
## 95%
## 3.6
How precise is that?
Example, Take 2
Resampling, re-estimating, and finding sampling distribution
Example, Take 2

Example, Take 2 (cont’d)
standard error, bias, CI
## [1] 0.07919035
## [1] -0.020185
## 2.5% 97.5%
## 3.400 3.785
Cats’ Hearts (cont’d)

Cats’ Hearts (cont’d)
- Coefficients and standard confidence intervals:
## (Intercept) Bwt
## -0.3566624 4.0340627
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
- These assume IID Gaussian noise
Cats’ Hearts (cont’d)
The residuals don’t look very Gaussian: 
Cats’ Hearts (cont’d)
Resample residuals:
Re-estimate on new data:
Cats’ Hearts (cont’d)
Re-sample to get CIs:
## 2.5% 97.5%
## (Intercept) -1.709555 0.9892904
## Bwt 3.572479 4.5158109
Cats’ Hearts (cont’d)
Try resampling whole rows:
## 2.5% 97.5%
## (Intercept) -1.866811 1.219944
## Bwt 3.455298 4.598331
Exercise
- We now have three sets of confidence intervals
- “Conventional”/Gaussian:
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
- By resampling residuals:
## 2.5% 97.5%
## (Intercept) -1.709555 0.9892904
## Bwt 3.572479 4.5158109
- By resampling rows/cases:
## 2.5% 97.5%
## (Intercept) -1.866811 1.219944
## Bwt 3.455298 4.598331
- Why do the intervals keep getting wider?
- Which of these is most reliable?
References
Lunde, Robert. 2018. “Bootstrapping and Sample Splitting Under Weak Dependence.” PhD thesis, Carnegie Mellon University.
Lunde, Robert, and Cosma Rohilla Shalizi. 2017. “Bootstrapping Generalization Error Bounds for Time Series.” arxiv:1711.02834. https://arxiv.org/abs/1711.02834.