Agenda

The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
The bootstrap principle: approximate the sampling distribution by simulating from a good model, and treating the simulation run just like the data
Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
Stronger assumptions \(\Rightarrow\) less uncertainty if we’re right

Statistical Uncertainty

Re-run the experiment (survey, census, …) and get different data
\(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well
This variability is (the source of) statistical uncertainty
Quantifying this is a way to be honest about what we actually know

Measures of Uncertainty

Standard error = standard deviation of an estimator
- (could use median absolute deviation instead, etc.)
\(p\)-value = Probability we’d see a signal this big if there was just noise
Confidence region = All the parameter values we can’t reject at low error rates:

Either the true parameter is in the confidence region
or we are very unlucky
- (Something happened whose probability was very low under all parameter values)
or our model is wrong

etc., etc.

The Sampling Distribution Is the Source of All Knowledge

Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)
We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)
If we knew \(P_{T,\theta_0}\), we could calculate
- \(\Var{T}\) (and so standard error)
- \(\Expect{T}\) (and so bias)
- quantiles (and so confidence intervals or \(p\)-values), etc.

The Problems

Most of the time, \(P_{X,\theta_0}\) is very complicated
Most of the time, \(\tau\) is a very complicated function
We certainly don’t know \(\theta_0\)

Upshot: We don’t know \(P_{T,\theta_0}\) and can’t use it to calculate anything

The Solutions

Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
- This is the asymptotic theory we did in Lecture 13 and Lecture 14
Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model

The Bootstrap Principle

Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
Generate a simulation \(\tilde{X}\) from \(\hat{P}\); set \(\tilde{T} = \tau(\tilde{X})\); repeat
Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)

First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)

Model-based Bootstrap

If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters

The Model-based Bootstrap

Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
Repeat \(m\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/“shape” as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
Use the simulated distribution of \(\tilde{T}\) as \(P_{T,\theta_0}\)

Example: lynxes in Canada

Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada:

data(lynx)
plot(lynx, xlab="Year", ylab="Lynxes", lwd=2)

Fitting an AR model to the example data

This oscillates, so an AR(1) isn’t appropriate — try an AR(2)

lynx.ar2 <- ar.ols(lynx, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
lynx.ar2$ar

## , , 1
## 
##           [,1]
## [1,]  1.152423
## [2,] -0.606229

lynx.ar2$x.intercept

## [1] 710.1056

so the estimated model is \[ X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t) \]

Simulating from the fitted model

What do some time series from this look like?
- (We wrote ar2.sim last time)

lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
                                   a=lynx.ar2$x.intercept,
                                   b=lynx.ar2$ar,
                                   sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
                                   x.start=lynx[1:2]))
plot(lynx, xlab="Year", ylab="Lynxes caught", lwd=2)
for (i in 1:ncol(lynx.sims)) {
    lines(1821:1934, lynx.sims[,i], col="grey", lty=i)
}
abline(h=0, col="blue", lwd=0.5)

Uncertainty in the coefficients

Write a function which estimates the coefficients

# Estimate the coefficients of an AR(2) model
# Inputs: a time series (x)
# Outputs: vector containing intercept and 2 AR slopes
ar2.estimator <- function(x) {
    # Treat the input exactly the same way we treated the real data
    fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
    return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}

Check that this gives the right answer on the data, and a different answer on a simulation:

ar2.estimator(lynx)

##          a         b1         b2 
## 710.105589   1.152423  -0.606229

ar2.estimator(lynx.sims[,1])

##          a         b1         b2 
## 648.795789   1.182618  -0.592997

Uncertainty in the coefficients

Now do lots of simulations, and apply this function to each simulation:

lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
                                   a=lynx.ar2$x.intercept,
                                   b=lynx.ar2$ar,
                                   sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
                                   x.start=lynx[1:2])))
dim(lynx.coef.sims)

## [1]    3 1000

lynx.coef.sims[1:3, 1:5]

##           [,1]        [,2]        [,3]        [,4]        [,5]
## a  619.2030763 728.1824888 933.0930750 608.3708893 709.5767510
## b1   1.1859839   1.1591091   1.1406471   1.2823124   1.1297578
## b2  -0.5645652  -0.6534608  -0.6916214  -0.6258483  -0.5969703

Uncertainty in the coefficients

Standard errors \(\approx\) standard deviation across simulations:

apply(lynx.coef.sims, 1, sd)

##            a           b1           b2 
## 127.16843433   0.07891514   0.07488047

95% confidence intervals \(\approx\) quantiles across simulations:

apply(lynx.coef.sims, 1, quantile, prob=c(0.025, 0.975))

##              a        b1         b2
## 2.5%  499.1899 0.9811642 -0.7387225
## 97.5% 989.8220 1.2836256 -0.4517181

For Gaussian AR\((p)\) models…

… the bootstrapping isn’t needed, because there the asymptotic theory we worked out before is actually exact
- Basically: Homework 7, only with even more book-keeping
But we can use this bootstrap approach for any generative time-series model
- E.g., make \(X(t)\) have a Poisson distribution rather than a Gaussian
  - No negative lynx counts, only whole numbers of lynxes…
- All we have to do is swap out the simulation

Prediction

You can apply these ideas when the “statistic” is \(X(t)\) for a fixed \(t\)
Or when it’s \(X(t) | X(t-1) = x\) for a fixed \(x\)
- Since the model tells us how to calculate that!
Expected values across simulations = best guess for the point prediction
Variances and intervals incorporate parameter uncertainty

Resampling

Problem: Suppose we don’t have a trust-worthy model
Resource: We do have data, which tells us a lot about the distribution
Solution: Resampling, treat the sample like a whole population

The Resampling (“Nonparameteric”) Bootstrap

Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
Fix a block length \(k\) and divide \(X\) into (overlapping) blocks of length \(k\)
- So first block is \((X(1), \ldots X(k))\), second block is \((X(2), \ldots X(k+1))\), down to block \(n-k+1\), which is \((X(n-k+1), X(n-k+2), \ldots X(n))\)
Repeat \(m\) times:
- Generate \(\tilde{X}\) by drawing \(n/k\) blocks from \(X\) with replacement (resample the data)
- Concatenate the blocks to make \(\tilde{X}\)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)
Use simulated distribution of \(\tilde{T}\) as \(P_{T,\theta}\)

Some useful code for block resampling

# Convert a time series into a data frame of lagged values
# Inputs: a time series (ts)
  # maximum lag to use (order)
  # whether older values go on the right or the left (right.older)
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
  # length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
  n <- length(ts)
  x <- ts[(order+1):n]
  for (lag in 1:order) {
    if (right.older) {
      x <- cbind(x,ts[(order+1-lag):(n-lag)])
    } else {
      x <- cbind(ts[(order+1-lag):(n-lag)],x)
    }
  }
  lag.names <- c("lag0",paste("lag",1:order,sep=""))
  if (right.older) {
    colnames(x) <- lag.names
  } else {
    colnames(x) <- rev(lag.names)
  }
  return(as.data.frame(x))
}

Example: Back to the lynxes

head(lynx)

## [1]  269  321  585  871 1475 2821

head(design.matrix.from.ts(lynx,2, right.older=FALSE))

##   lag2 lag1 lag0
## 1  269  321  585
## 2  321  585  871
## 3  585  871 1475
## 4  871 1475 2821
## 5 1475 2821 3928
## 6 2821 3928 5943

Example: Back to the lynxes

# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
  # chop up ts into blocks
  the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
    right.older=FALSE))
    # look carefully at design.matrix.from.ts to see why we need the -1
  # How many blocks is that?
  blocks.in.ts <- nrow(the.blocks)
  # Sanity-check
  stopifnot(blocks.in.ts == length(ts) - block.length+1)
  # How many blocks will we need (round up)?
  blocks.needed <- ceiling(len.out/block.length)
  # Sample blocks with replacement
  picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
  # put the blocks in the randomly-selected order
  x <- the.blocks[picked.blocks,]
  # convert from a matrix to a vector and return
    # need transpose because R goes from vectors to matrices and back column by
    # column, not row by row
  x.vec <- as.vector(t(x))
    # Discard uneeded extra observations at the end silently
  return(x.vec[1:len.out])
}

head(lynx)

## [1]  269  321  585  871 1475 2821

head(rblockboot(lynx, 3))

## [1] 3409 1824  409  674   81   80

Resampling the lynxes

We can do everything the same as before, just changing how we simulate

lynx.coef.sims.resamp <- replicate(1000, ar2.estimator(rblockboot(lynx, 3)))

Standard errors:

apply(lynx.coef.sims.resamp, 1, sd)

##            a           b1           b2 
## 209.95314712   0.09850266   0.07904893

95% confidence intervals:

apply(lynx.coef.sims.resamp, 1, quantile, prob=c(0.025, 0.975))

##               a        b1          b2
## 2.5%   633.6568 0.3762097 -0.36009610
## 97.5% 1453.3612 0.7614678 -0.05949003

Why should this work?

Preserves the distribution within blocks
Independence across blocks
So \(\tilde{X} \approx\) stationary, with the same short-range distribution as \(X\) but no longer-range dependence
Should work well if \(X\) is stationary without long-range dependence
Want to let \(k\) grow with \(n\), but still have a growing number of blocks, so \(n/k_n \rightarrow \infty\), or \(k=o(n)\)
Some (complicated) theory suggests \(k \propto n^{1/3}\)
- Politis and White (2004) provides an automatic procedure for picking \(k\)
  - Estimates bias (from too-short blocks) and variance (from too-few long blocks) and balances
- R implementation at [http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R]
- …which suggests a block-length of 3 for the lynxes

What about non-stationary time series?

Model-based: OK if the model is non-stationary
Otherwise:
- Estimate the trend
- Bootstrap the deviations from the trend (by model or resampling)
- Add these back to the trend
- \(\tilde{X} =\) trend + bootstrapped deviations

What about spatial data?

Model-based: simulate the model
Resampling: use spatial blocks (e.g., rectangles)
- Levina and Bickel (2006) suggests the best blocks aren’t rectangles but rectangles with a notch cut out at one corner

Sources of Error in Bootstrapping

Simulation Using only \(m\) bootstrap replicates
- Make this small by letting \(m\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{m})\)
Approximation Using \(\hat{P}\) instead of \(P_{X,\theta_0}\)
- Make this small by careful statistical modeling
Estimation Only a finite number of samples \(n\)
- Make this small by being careful about what we simulate (example in backup slides)

Generally: for fixed \(n\), resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)

Summary

Bootstrapping is applying the Monte Carlo idea to the sampling distribution
- Find a good approximation to the data-generating distribution
- Simulate it many times
- Treat each simulation run just like the original data
- Distribution over simulations \(\approx\) sampling distribution
Choice of how to simulate:
- Based on a model
- By resampling blocks
Many, many refinements, but this is the core of it

Backup: Some General Ways of Improving the Bootstrap

Improving the initial estimate \(\hat{P}\)
- Model-based vs. resampling is a bias/variance trade-off here
Reducing the number of simulations or speeding them up
- E.g., if we’re estimating a mean, we’d like to get pairs of negatively correlated samples; there are tricks for doing this (“antithetic sampling”)
Transforming \(\tau\) so the final approximation is more stable
- The distribution of \(\tilde{\theta}\) approaches that of \(\hat{\theta}\) as \(n\) grows
- But often the distribution of \(\tilde{\theta}-\hat{\theta}\) approaches that of \(\hat{\theta}-\theta_0\) more quickly (“pivoting”; see below)
- And the distribution of \(\frac{\tilde{\theta}-\hat{\theta}}{\sqrt{\Var{\tilde{\theta}}}}\) is an even better approximation to the distribution of \(\frac{\hat{\theta}-\theta_0}{\sqrt{\Var{\hat{\theta}}}}\) (“studentizing”)

Backup: The “surrogate data” method

You think that \(X\) comes from some interesting class of processes \(\mathcal{H}_1\)
Your skeptical friend thinks maybe it’s just some boring process in \(\mathcal{H}_0\)
Fit a model to \(X\) using only the boring processes in \(\mathcal{H}_0\), say \(\hat{Q}\)
Now simulate \(\hat{Q}\) to get many “surrogate data” sets \(\tilde{X}\)
Look at the distribution of \(\tau(\tilde{X})\) over your simulations
If \(\tau(X)\) is really unlikely under that simulation, it’s evidence against \(\mathcal{H}_0\)
- When would it be evidence for \(\mathcal{H}_1\)?
This is a test of the hypothesis \(\mathcal{H}_0\) against the alternative \(\mathcal{H}_1\)
The term “surrogate data” comes out of physics (Theiler et al. 1992)
- testing the null of “this is a linear stochastic process” vs. “this is a deterministic, but chaotic, nonlinear dynamical system”

Backup: Incorporating uncertainty in the initial conditions

Simulating a dynamical model needs an initial value \(X(0)\) (or \(X(1)\), or maybe \(X(1), X(2)\), etc.)
What if this isn’t known precisely?
Easy case: it follows a probability distribution
- Then draw \(X(0)\) from that distribution independently on each run
- Aggregate across runs as usual
Less easy: \(X(0)\) within known limits but no distribution otherwise
- My advice: draw \(X(0)\) uniformly from inside the region
- Rationale 1: Maximum entropy distribution compatible with constraints
- Rationale 2: If the process is rapidly mixing, memory of the distribution of \(X(0)\) decays quickly anyway

Backup: Improving on the Crude Confidence Interval

Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than the distribution of \(\tilde{\theta}\) is to \(\hat{\theta}\)
- (“centering” or “pivoting”)

Backup: The Basic, Pivotal CI

quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)

\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]

Basically: re-center the simulations around the empirical data

2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.975)

##          a         b1         b2 
## -33.149997   1.543377  -1.152968

2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.025)

##          a         b1         b2 
## 786.554385   1.928635  -0.852362

Backup: Further reading

Bootstrap begins with Efron (1979) (drawing on an older idea from the 1940s called the “jackknife”)
Block bootstrap for time series comes from Künsch (1989)
Bühlmann (2002) is a good review on time series bootstraps
Davison and Hinkley (1997) is the best over-all textbook on the bootstrap
Lahiri (2003) is very theoretical

References

Bühlmann, Peter. 2002. “Bootstraps for Time Series.” Statistical Science 17:52–72. https://doi.org/10.1214/ss/1023798998.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” Annals of Statistics 7:1–26. https://doi.org/10.1214/aos/1176344552.

Künsch, Hans R. 1989. “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics 17:1217–41. https://doi.org/10.1214/aos/1176347265.

Lahiri, S. N. 2003. Resampling Methods for Dependent Data. New York: Springer-Verlag.

Levina, Elizaveta, and Peter J. Bickel. 2006. “Texture Synthesis and Nonparametric Resampling of Random Fields.” Annals of Statistics 34:1751–73. http://projecteuclid.org/euclid.aos/1162567632.

Politis, Dimitris N., and Halbert White. 2004. “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23:53–70. https://doi.org/10.1081/ETC-120028836.

Theiler, James, Bryan Galdrikian, André Longtin, Stephen Eubank, and J. Doyne Farmer. 1992. “Using Surrogate Data to Detect Nonlinearity in Time Series.” In Nonlinear Modeling and Forecasting: Proceedings of the Workshop on Nonlinear Modeling and Forecasting Held September, 1990 in Santa Fe, New Mexico, edited by Martin Casdagli and Stephen Eubank. Reading, Massachusetts: Addison-Wesley.

Simulation for Inference I — The Bootstrap

In our last episode

Agenda

Statistical Uncertainty

Measures of Uncertainty

The Sampling Distribution Is the Source of All Knowledge

The Problems

The Solutions

The Bootstrap Principle

Model-based Bootstrap

The Model-based Bootstrap

Example: lynxes in Canada

Example: lynxes in Canada

Fitting an AR model to the example data

Simulating from the fitted model

Uncertainty in the coefficients

Uncertainty in the coefficients

Uncertainty in the coefficients

For Gaussian AR\((p)\) models…

Prediction

Resampling

The Resampling (“Nonparameteric”) Bootstrap

Some useful code for block resampling

Example: Back to the lynxes

Example: Back to the lynxes

Resampling the lynxes

Why should this work?

What about non-stationary time series?

What about spatial data?

Sources of Error in Bootstrapping

Summary

Backup: Some General Ways of Improving the Bootstrap

Backup: The “surrogate data” method

Backup: Incorporating uncertainty in the initial conditions

Backup: Improving on the Crude Confidence Interval

Backup: The Basic, Pivotal CI

Backup: Further reading

References