Bootstrap (with feeling this time)

9 February 2017

The Big Picture

\[ \newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)} \]

Knowing the sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
The bootstrap principle: approximate the sampling distribution by simulating from a good model of the data, and treating the simulated data just like the real data
Sometimes we simulate from the model we're estimating (model-based or "parametric" bootstrap)
Sometimes we simulate by re-sampling the original data (resampling or "nonparametric" bootstrap)
Stronger assumptions \(\Rightarrow\) less uncertainty if we're right

Statistical Uncertainty

Re-run the experiment (survey, census, …) and get different data

\(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well

This variability is (the source of) statistical uncertainty

Quantifying this is a way to be honest about what we actually know

Measures of Uncertainty

Standard error = standard deviation of an estimator (could equally well use median absolute deviation, etc.)

\(p\)-value = Probability we'd see a signal this big if there was just noise

Confidence region = All the parameter values we can't reject at low error rates:

Either the true parameter is in the confidence region
or we are very unlucky
or our model is wrong

etc., etc.

The Sampling Distribution Is the Source of All Knowledge

Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)

We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)

If we knew \(P_{T,\theta_0}\), we could calculate

\(\Var{T}\) (and so standard error)
\(\Expect{T}\) (and so bias)
quantiles (and so confidence intervals or \(p\)-values), etc.

The Problems

Problem 1: Most of the time, \(P_{X,\theta_0}\) is very complicated

Problem 2: Most of the time, \(\tau\) is a very complicated function

Problem 3: We certainly don't know \(\theta_0\)

Upshot: We don't know \(P_{T,\theta_0}\) and can't use it to calculate anything

The Solutions

Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)

Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model

The Bootstrap Principle

Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
Generate a simulation \(\tilde{X}\) from \(\hat{P}\), set \(\tilde{T} = \tau(\tilde{X})\)
Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)

Refinements:

improving the initial estimate \(\hat{P}\)
reducing the number of simulations or speeding them up
transforming \(\tau\) so the final approximation is more stable

First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)

Model-based Bootstrap

If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters

The Model-based Bootstrap

Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
Repeat \(b\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/"shape" as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
Use empirical distribution of \(\tilde{T}\) as \(P_{T,\theta_0}\)

Example: Is Karakedi overweight?

Example (cont'd.)

library(MASS); data(cats); summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

(q95.gaussian <- qnorm(0.95,mean=mean(cats$Bwt),sd=sd(cats$Bwt)))

## [1] 3.521869

Example (cont'd.)

Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function

rcats.gaussian <- function() {
  rnorm(n=nrow(cats),
        mean=mean(cats$Bwt),
        sd=sd(cats$Bwt))
}

est.q95.gaussian <- function(x) {
  m <- mean(x)
  s <- sd(x)
  return(qnorm(0.95,mean=m,sd=s))
}

Example (cont'd.)

Simulate, plot the sampling distribution from the simulations

sampling.dist.gaussian <- replicate(1000, est.q95.gaussian(rcats.gaussian()))
plot(hist(sampling.dist.gaussian,breaks=50, plot=FALSE), freq=FALSE)
lines(density(sampling.dist.gaussian),lwd=2)
abline(v=q95.gaussian,lty="dashed",lwd=4)

Example (cont'd.)

Find standard error and a crude confidence interval

sd(sampling.dist.gaussian)

## [1] 0.05847899

quantile(sampling.dist.gaussian,c(0.025,0.975))

##     2.5%    97.5% 
## 3.408218 3.632092

Model Checking

As always, if the model isn't right, relying on the model is asking for trouble

Is the Gaussian a good model for cats' weights?

Example (re-cont'd.)

Compare histogram to fitted Gaussian density and to a smooth density estimate

Resampling

Problem: Suppose we don't have a trust-worthy model

Resource: We do have data, which tells us a lot about the distribution

Solution: Resampling, treat the sample like a whole population

The Resampling ("Nonparameteric") Bootstrap

Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
Repeat \(b\) times:
- Generate \(\tilde{X}\) by drawing \(n\) samples from \(X\) with replacement (resample the data)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)
Use empirical distribution of \(\tilde{T}\) as \(P_{T,\theta}\)

Example, Take 2

Model-free estimate of the 95th percentile is the 95th percentile of the data

(q95.np <- quantile(cats$Bwt,0.95))

## 95% 
## 3.6

How precise is that?

Example, Take 2

Resampling, re-estimating, and finding sampling distribution

resample <- function(x) {
  sample(x,size=length(x),replace=TRUE)
}

est.q95.np <- function(x) { quantile(x,0.95) }

Example, Take 2

sampling.dist.np <- replicate(1000, est.q95.np(resample(cats$Bwt)))
plot(density(sampling.dist.np), main="", xlab="Body weight (kg)")
abline(v=q95.np,lty=2)

Concrete Example, Take 2

standard error, bias, CI

sd(sampling.dist.np)

## [1] 0.07573982

mean(sampling.dist.np - q95.np)

## [1] -0.01858

quantile(sampling.dist.np,c(0.025,0.975))

##     2.5%    97.5% 
## 3.482875 3.785000

Bootstrapping Regressions

A regression is a model for \(Y\) conditional on \(X\) \[ Y= m(X) + \mathrm{noise} \] Silent about distribution of \(X\), so how do we simulate?

Options, putting less and less trust in the model:

Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}\) from model's estimated noise distribution (e.g., Gaussian or \(t\))
Hold \(x_i\) fixed, set \(\tilde{y}_i = \hat{m}(x_i) + \mathrm{noise}\), noise resampled from the residuals
Resample \((x_i, y_i)\) pairs (resample data-points or resample cases)

Cats' Hearts

cats has weights for cats' hearts, as well as bodies

(Much cuter than any photo of real cats' hearts)

How does heart weight relate to body weight?

(Useful if Kara's vet wants to know how much heart medicine to prescribe)

Cats' Hearts

plot(Hwt~Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
cats.lm <- lm(Hwt ~ Bwt, data=cats)
abline(cats.lm)

Cats' Hearts

Coefficients and "standard" confidence intervals:

coefficients(cats.lm)

## (Intercept)         Bwt 
##  -0.3566624   4.0340627

confint(cats.lm)

##                 2.5 %   97.5 %
## (Intercept) -1.725163 1.011838
## Bwt          3.539343 4.528782

Cats' Hearts

The residuals don't look very Gaussian:

Cats' Hearts

Resample residuals:

sim.cats.resids <- function() {
  new.cats <- cats
  noise <- resample(residuals(cats.lm))
  new.cats$Hwt <- fitted(cats.lm) + noise
  return(new.cats)
}

Re-estimate on new data:

coefs.cats.lm <- function(df) {
  fit <- lm(Hwt~Bwt,data=df)
  return(coefficients(fit))
}

Cats' Hearts

Re-sample to get CIs:

cats.lm.samp.dist.resids <- replicate(1000,
   coefs.cats.lm(sim.cats.resids()))
apply(cats.lm.samp.dist.resids,1,quantile,c(0.025,0.975))

##       (Intercept)      Bwt
## 2.5%    -1.696977 3.561274
## 97.5%    0.964596 4.522391

Cats' Hearts

Try resampling whole rows:

resample.data.frame <- function(df) {
  return(df[resample(1:nrow(df)),])
}

cats.lm.samp.dist.cases <- replicate(1000,
  coefs.cats.lm(resample.data.frame(cats)))
apply(cats.lm.samp.dist.cases,1,quantile,c(0.025,0.975))

##       (Intercept)      Bwt
## 2.5%    -1.967652 3.410881
## 97.5%    1.262976 4.655187

Why Do This?

Resampling residuals works as long as the noise is IID
- Noise could be Gaussian…
- Or it could be very non-Gaussian
Resampling whole cases works as long as observations are IID
- noise needn't be independent of \(X\)
- needn't be Gaussian
- linear model needn't be right

Sources of Error in Bootstrapping

Simulation Using only \(b\) bootstrap replicates
- Make this small by letting \(b\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{b})\)
Approximation Using \(\hat{P}\) instead of \(P_{X,\theta_0}\)
- Make this small by careful statistical modeling
Estimation Only a finite number of samples
- Make this small by being careful about what we simulate (example next slide)

Generally: for fixed \(n\), nonparametric boostrap shows more uncertainty than parametric bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)

Improving on the Crude Confidence Interval

Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)

But really we want the distribution of \(\hat{\theta}\) under \(\theta\)

Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)

("centering" or "pivoting")

The Basic, Pivotal CI

quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)

\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]

Basically: re-center the simulations around the empirical data