Bootstrap
36-402, Section A
5 February 2019
The Big Picture
\[
\newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)}
\newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)}
\]
- Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
- The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
- Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
- Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
- Stronger assumptions \(\Rightarrow\) less uncertainty if we’re right
Statistical Uncertainty
- Re-run the experiment (survey, census, …) and get different data
- \(\therefore\) everything we calculate from data (estimates, test statistics, \(p\)-values, policy recommendations, …) will change from run to run
- This variability is (the source of) statistical uncertainty
- Quantifying this = honesty about what we actually know
Measures of Uncertainty
- Standard error = standard deviation of an estimator
- (could equally well use median absolute deviation, etc.)
- \(p\)-value = Probability we’d see a signal this big if there was just noise
- Confidence region = All the parameter values we can’t reject at low error rates
- Either the true parameter is in the confidence region
- or we are very unlucky
- or our model is wrong
- etc., etc.
The Sampling Distribution Is the Source of All Knowledge
- Data \(X \sim P_X\) for some unknown true distribution \(P_X\)
- We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T}\)
- If we knew \(P_{T}\), we could calculate
- \(\Var{T}\) (and so standard error)
- \(\Expect{T}\) (and so bias)
- quantiles (and so confidence intervals or \(p\)-values), etc.
The Difficulties
- Difficulty 1: Most of the time, \(P_{X}\) is very complicated
- Difficulty 2: Most of the time, \(\tau\) is a very complicated function
- \(\therefore\) We couldn’t solve for \(P_T\)
- Difficulty 3: Actually, we don’t know \(P_X\)
- Upshot: We really don’t know \(P_{T}\) and can’t use it to calculate anything
The Solutions
Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model
The Monte Carlo Principle
- Generate a simulate \(\tilde{X}\) from \(P_X\)
- Set \(\tilde{T} = \tau(\tilde{X})\)
- Repeat many times
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T}\)
- (As a general method, invented by Enrico Fermi in the 1930s, spread through the Manhattan Project)
- Still needs \(P_X\)
- Works in HW 3 because we’re testing a fixed model
The Bootstrap Principle
- Find a good estimate \(\hat{P}\) for \(P_{X}\)
- Generate a simulation \(\tilde{X}\) from \(\hat{P}\), set \(\tilde{T} = \tau(\tilde{X})\)
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T}\)
- “Pull yourself up by your bootstraps”: use \(\hat{P}\) to get at uncertainty in itself
- Invented by Bradley Efron in the 1970s
- First step: find a good estimate \(\hat{P}\) for \(P_{X}\)
Model-based Bootstrap
If we are using a model, our best guess at \(P_{X}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters
The Model-based Bootstrap
- Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
- Repeat \(b\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/“shape” as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
- Use empirical distribution of \(\tilde{T}\) as \(P_{T}\)
Example: Is Karakedi overweight?
Example (cont’d.)
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
## [1] 3.521869
Example (cont’d.)
Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function
Example (cont’d.)
Simulate, plot the sampling distribution from the simulations
Example (cont’d.)
Find standard error and a (crude) confidence interval
## [1] 0.0626791
## 2.5% 97.5%
## 3.408126 3.648369
Model Checking
As always, if the model isn’t right, relying on the model is asking for trouble
Is the Gaussian a good model for cats’ weights?
Example (re-cont’d.)
Compare histogram to fitted Gaussian density and to a smooth density estimate
Resampling
Difficulty: We might not have a trust-worthy model
Resource: We do have data, which tells us a lot about the distribution
Solution: Resampling, treat the sample like a whole population
The Resampling (“Nonparameteric”) Bootstrap
- Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
- Repeat \(b\) times:
- Generate \(\tilde{X}\) by drawing \(n\) samples from \(X\) with replacement (resample the data)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)
- Use empirical distribution of \(\tilde{T}\) as \(P_{T}\)
Example, Take 2
Model-free estimate of the 95th percentile = 95th percentile of the data
## 95%
## 3.6
How precise is that?
Example, Take 2
Resampling, re-estimating, and finding sampling distribution
Example, Take 2
Example, Take 2 (cont’d)
standard error, bias, CI
## [1] 0.07920881
## [1] -0.023945
## 2.5% 97.5%
## 3.400 3.785
Bootstrapping Regressions
- A regression is a model for \(Y\) conditional on \(X\) \[
Y= \mu(X) + \epsilon, ~ \Expect{\epsilon|X} = 0
\]
- This is silent on the distribution of \(X\), so how do we simulate?
- Options, putting less and less trust in the model:
- Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{\mu}(x_i) + \tilde{\epsilon}_i\) from model’s estimated noise distribution (e.g., Gaussian or \(t\))
- Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{\mu}(x_i) + \tilde{\epsilon}_i\), noise resampled from the residuals
- Resample \((x_i, y_i)\) pairs (resample data-points or resample cases)
Different Regression Bootstraps
- Resampling residuals works as long as the noise is IID
- Noise could be Gaussian…
- Or noise could be very non-Gaussian
- Noise does need have same distribution everywhere
- Dubious unless the regression model is right
- Resampling whole cases works as long as observations are IID
- noise needn’t be independent of \(X\)
- needn’t be Gaussian
- regression model needn’t be right
Cats’ Hearts (cont’d)
Cats’ Hearts (cont’d)
- Coefficients and standard confidence intervals:
## (Intercept) Bwt
## -0.3566624 4.0340627
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
- These assume IID Gaussian noise
Cats’ Hearts (cont’d)
The residuals don’t look very Gaussian:
Cats’ Hearts (cont’d)
Resample residuals:
Re-estimate on new data:
Cats’ Hearts (cont’d)
Re-sample to get CIs:
## 2.5% 97.5%
## (Intercept) -1.716803 0.9534706
## Bwt 3.569200 4.5458607
Cats’ Hearts (cont’d)
Try resampling whole rows:
## 2.5% 97.5%
## (Intercept) -2.052301 1.196093
## Bwt 3.410005 4.693054
Exercise
- We now have three sets of confidence intervals
- “Conventional”/Gaussian:
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
- By resampling residuals:
## 2.5% 97.5%
## (Intercept) -1.716803 0.9534706
## Bwt 3.569200 4.5458607
- By resampling rows/cases:
## 2.5% 97.5%
## (Intercept) -2.052301 1.196093
## Bwt 3.410005 4.693054
- Why do the intervals keep getting wider?
- Which of these is most reliable?
Sources of Error in Bootstrapping
- Simulation Using only \(b\) bootstrap replicates
- Make this small by letting \(b\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{b})\)
- There are tricks for speeding up simulations, re-using parts of them, etc.
- Approximation Using \(\hat{P}\) instead of \(P_{X}\)
- Make this small by careful statistical modeling
- Estimation Only a finite number of samples
- Make this small by being careful about what we simulate (example at end of slides)
- For fixed \(n\), resampling generally shows more uncertainty than model-based bootstrap
- Resampling is less vulnerable to modeling mistakes
Summing Up
- We need to know the sampling distribution of our statistic \(T = \tau(X)\)
- If we could simulate \(\tilde{X}\) from the true distribution of \(X\), we’d use the distribution of \(\tau(\tilde{X})\) as the sampling distribution (“Monte Carlo”)
- We simulate from an estimate of the distribution of \(X\) instead (“bootstrap”)
- The simulation could be based on a model
- Or on re-sampling the observations
- Or some combination
Backup: Improving on the Crude Confidence Interval
Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)
(“centering” or “pivoting”)
Backup: The Basic, Pivotal CI
quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)
\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
& \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]
Basically: re-center the simulations around the empirical data