Simulation for Inference I — The Bootstrap
36-467/667
22 October 2020 (Lecture 16)
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]}
\newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\]
In our last episode
- General strategy for simulating
- The Monte Carlo method: replace complicated probability calculations with repeated simulation
- To find \(\Expect{f(Y)}\), do \(m\) simulations \(Y_1, Y_2, ldots Y_m\)
- Then \(m^{-1}\sum_{1}^{m}{f(Y_i)} \approx \Expect{f(Y)}\)
- Also works for variances, probabilities, etc.
Agenda
- The sampling distribution of a statistic tells us about statistical uncertainty (standard errors, biases, confidence sets)
- The bootstrap principle: approximate the sampling distribution by simulating from a good model, and treating the simulation run just like the data
- Sometimes we simulate from the model we’re estimating (model-based or “parametric” bootstrap)
- Sometimes we simulate by re-sampling the original data (resampling or “nonparametric” bootstrap)
- Stronger assumptions \(\Rightarrow\) less uncertainty if we’re right
Statistical Uncertainty
- Re-run the experiment (survey, census, …) and get different data
- \(\therefore\) everything we calculate from data (estimates, test statistics, policies, …) will change from trial to trial as well
- This variability is (the source of) statistical uncertainty
- Quantifying this is a way to be honest about what we actually know
Measures of Uncertainty
- Standard error = standard deviation of an estimator
- (could use median absolute deviation instead, etc.)
- \(p\)-value = Probability we’d see a signal this big if there was just noise
- Confidence region = All the parameter values we can’t reject at low error rates:
- Either the true parameter is in the confidence region
- or we are very unlucky
- (Something happened whose probability was very low under all parameter values)
- or our model is wrong
The Sampling Distribution Is the Source of All Knowledge
- Data \(X \sim P_{X,\theta_0}\), for some true \(\theta_0\)
- We calculate a statistic \(T = \tau(X)\) so it has distribution \(P_{T,\theta_0}\)
- If we knew \(P_{T,\theta_0}\), we could calculate
- \(\Var{T}\) (and so standard error)
- \(\Expect{T}\) (and so bias)
- quantiles (and so confidence intervals or \(p\)-values), etc.
The Problems
- Most of the time, \(P_{X,\theta_0}\) is very complicated
- Most of the time, \(\tau\) is a very complicated function
- We certainly don’t know \(\theta_0\)
- Upshot: We don’t know \(P_{T,\theta_0}\) and can’t use it to calculate anything
The Solutions
- Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
- Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model
The Bootstrap Principle
- Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
- Generate a simulation \(\tilde{X}\) from \(\hat{P}\); set \(\tilde{T} = \tau(\tilde{X})\); repeat
- Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)
- First step: find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
Model-based Bootstrap
If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters
The Model-based Bootstrap
- Get data \(X\), estimate \(\hat{\theta}\) from \(X\)
- Repeat \(m\) times:
- Simulate \(\tilde{X}\) from \(P_{X,\hat{\theta}}\) (simulate data of same size/“shape” as real data)
- Calculate \(\tilde{T} = \tau(\tilde{X}\)) (treat simulated data the same as real data)
- Use the simulated distribution of \(\tilde{T}\) as \(P_{T,\theta_0}\)
Example: lynxes in Canada
Example: lynxes in Canada
- Annual time series of number of lynxes trapped for the fur company in the Mackenzie River area of Canada:
Fitting an AR model to the example data
- This oscillates, so an AR(1) isn’t appropriate — try an AR(2)
## , , 1
##
## [,1]
## [1,] 1.152423
## [2,] -0.606229
## [1] 710.1056
so the estimated model is \[
X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t)
\]
Simulating from the fitted model
- What do some time series from this look like?
- (We wrote
ar2.sim
last time)
lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
a=lynx.ar2$x.intercept,
b=lynx.ar2$ar,
sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
x.start=lynx[1:2]))
plot(lynx, xlab="Year", ylab="Lynxes caught", lwd=2)
for (i in 1:ncol(lynx.sims)) {
lines(1821:1934, lynx.sims[,i], col="grey", lty=i)
}
abline(h=0, col="blue", lwd=0.5)
Uncertainty in the coefficients
- Write a function which estimates the coefficients
Check that this gives the right answer on the data, and a different answer on a simulation:
## a b1 b2
## 710.105589 1.152423 -0.606229
## a b1 b2
## 648.795789 1.182618 -0.592997
Uncertainty in the coefficients
Now do lots of simulations, and apply this function to each simulation:
## [1] 3 1000
## [,1] [,2] [,3] [,4] [,5]
## a 619.2030763 728.1824888 933.0930750 608.3708893 709.5767510
## b1 1.1859839 1.1591091 1.1406471 1.2823124 1.1297578
## b2 -0.5645652 -0.6534608 -0.6916214 -0.6258483 -0.5969703
Uncertainty in the coefficients
Standard errors \(\approx\) standard deviation across simulations:
## a b1 b2
## 127.16843433 0.07891514 0.07488047
95% confidence intervals \(\approx\) quantiles across simulations:
## a b1 b2
## 2.5% 499.1899 0.9811642 -0.7387225
## 97.5% 989.8220 1.2836256 -0.4517181
For Gaussian AR\((p)\) models…
- … the bootstrapping isn’t needed, because there the asymptotic theory we worked out before is actually exact
- Basically: Homework 7, only with even more book-keeping
- But we can use this bootstrap approach for any generative time-series model
- E.g., make \(X(t)\) have a Poisson distribution rather than a Gaussian
- No negative lynx counts, only whole numbers of lynxes…
- All we have to do is swap out the simulation
Prediction
- You can apply these ideas when the “statistic” is \(X(t)\) for a fixed \(t\)
- Or when it’s \(X(t) | X(t-1) = x\) for a fixed \(x\)
- Since the model tells us how to calculate that!
- Expected values across simulations = best guess for the point prediction
- Variances and intervals incorporate parameter uncertainty
Resampling
- Problem: Suppose we don’t have a trust-worthy model
- Resource: We do have data, which tells us a lot about the distribution
- Solution: Resampling, treat the sample like a whole population
The Resampling (“Nonparameteric”) Bootstrap
- Get data \(X\), containing \(n\) samples, find \(T=\tau(X)\)
- Fix a block length \(k\) and divide \(X\) into (overlapping) blocks of length \(k\)
- So first block is \((X(1), \ldots X(k))\), second block is \((X(2), \ldots X(k+1))\), down to block \(n-k+1\), which is \((X(n-k+1), X(n-k+2), \ldots X(n))\)
- Repeat \(m\) times:
- Generate \(\tilde{X}\) by drawing \(n/k\) blocks from \(X\) with replacement (resample the data)
- Concatenate the blocks to make \(\tilde{X}\)
- Calculate \(\tilde{T} = \tau(\tilde{X})\) (treat simulated data the same as real data)
- Use simulated distribution of \(\tilde{T}\) as \(P_{T,\theta}\)
Some useful code for block resampling
# Convert a time series into a data frame of lagged values
# Inputs: a time series (ts)
# maximum lag to use (order)
# whether older values go on the right or the left (right.older)
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
# length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
n <- length(ts)
x <- ts[(order+1):n]
for (lag in 1:order) {
if (right.older) {
x <- cbind(x,ts[(order+1-lag):(n-lag)])
} else {
x <- cbind(ts[(order+1-lag):(n-lag)],x)
}
}
lag.names <- c("lag0",paste("lag",1:order,sep=""))
if (right.older) {
colnames(x) <- lag.names
} else {
colnames(x) <- rev(lag.names)
}
return(as.data.frame(x))
}
Example: Back to the lynxes
## [1] 269 321 585 871 1475 2821
## lag2 lag1 lag0
## 1 269 321 585
## 2 321 585 871
## 3 585 871 1475
## 4 871 1475 2821
## 5 1475 2821 3928
## 6 2821 3928 5943
Example: Back to the lynxes
# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
# chop up ts into blocks
the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
right.older=FALSE))
# look carefully at design.matrix.from.ts to see why we need the -1
# How many blocks is that?
blocks.in.ts <- nrow(the.blocks)
# Sanity-check
stopifnot(blocks.in.ts == length(ts) - block.length+1)
# How many blocks will we need (round up)?
blocks.needed <- ceiling(len.out/block.length)
# Sample blocks with replacement
picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
# put the blocks in the randomly-selected order
x <- the.blocks[picked.blocks,]
# convert from a matrix to a vector and return
# need transpose because R goes from vectors to matrices and back column by
# column, not row by row
x.vec <- as.vector(t(x))
# Discard uneeded extra observations at the end silently
return(x.vec[1:len.out])
}
## [1] 269 321 585 871 1475 2821
## [1] 3409 1824 409 674 81 80
Resampling the lynxes
- We can do everything the same as before, just changing how we simulate
## a b1 b2
## 209.95314712 0.09850266 0.07904893
- 95% confidence intervals:
## a b1 b2
## 2.5% 633.6568 0.3762097 -0.36009610
## 97.5% 1453.3612 0.7614678 -0.05949003
Why should this work?
- Preserves the distribution within blocks
- Independence across blocks
- So \(\tilde{X} \approx\) stationary, with the same short-range distribution as \(X\) but no longer-range dependence
- Should work well if \(X\) is stationary without long-range dependence
- Want to let \(k\) grow with \(n\), but still have a growing number of blocks, so \(n/k_n \rightarrow \infty\), or \(k=o(n)\)
- Some (complicated) theory suggests \(k \propto n^{1/3}\)
- Politis and White (2004) provides an automatic procedure for picking \(k\)
- Estimates bias (from too-short blocks) and variance (from too-few long blocks) and balances
- R implementation at [http://www.math.ucsd.edu/~politis/SOFT/PPW/ppw.R]
- …which suggests a block-length of 3 for the lynxes
What about non-stationary time series?
- Model-based: OK if the model is non-stationary
- Otherwise:
- Estimate the trend
- Bootstrap the deviations from the trend (by model or resampling)
- Add these back to the trend
- \(\tilde{X} =\) trend + bootstrapped deviations
What about spatial data?
- Model-based: simulate the model
- Resampling: use spatial blocks (e.g., rectangles)
- Levina and Bickel (2006) suggests the best blocks aren’t rectangles but rectangles with a notch cut out at one corner
Sources of Error in Bootstrapping
- Simulation Using only \(m\) bootstrap replicates
- Make this small by letting \(m\rightarrow\infty\)
- Costs computing time
- Diminishing returns: error is generally \(O(1/\sqrt{m})\)
- Approximation Using \(\hat{P}\) instead of \(P_{X,\theta_0}\)
- Make this small by careful statistical modeling
- Estimation Only a finite number of samples \(n\)
- Make this small by being careful about what we simulate (example in backup slides)
- Generally: for fixed \(n\), resampling boostrap shows more uncertainty than model-based bootstraps, but is less at risk to modeling mistakes (yet another bias-variance tradeoff)
Summary
- Bootstrapping is applying the Monte Carlo idea to the sampling distribution
- Find a good approximation to the data-generating distribution
- Simulate it many times
- Treat each simulation run just like the original data
- Distribution over simulations \(\approx\) sampling distribution
- Choice of how to simulate:
- Based on a model
- By resampling blocks
- Many, many refinements, but this is the core of it
Backup: Some General Ways of Improving the Bootstrap
- Improving the initial estimate \(\hat{P}\)
- Model-based vs. resampling is a bias/variance trade-off here
- Reducing the number of simulations or speeding them up
- E.g., if we’re estimating a mean, we’d like to get pairs of negatively correlated samples; there are tricks for doing this (“antithetic sampling”)
- Transforming \(\tau\) so the final approximation is more stable
- The distribution of \(\tilde{\theta}\) approaches that of \(\hat{\theta}\) as \(n\) grows
- But often the distribution of \(\tilde{\theta}-\hat{\theta}\) approaches that of \(\hat{\theta}-\theta_0\) more quickly (“pivoting”; see below)
- And the distribution of \(\frac{\tilde{\theta}-\hat{\theta}}{\sqrt{\Var{\tilde{\theta}}}}\) is an even better approximation to the distribution of \(\frac{\hat{\theta}-\theta_0}{\sqrt{\Var{\hat{\theta}}}}\) (“studentizing”)
Backup: The “surrogate data” method
- You think that \(X\) comes from some interesting class of processes \(\mathcal{H}_1\)
- Your skeptical friend thinks maybe it’s just some boring process in \(\mathcal{H}_0\)
- Fit a model to \(X\) using only the boring processes in \(\mathcal{H}_0\), say \(\hat{Q}\)
- Now simulate \(\hat{Q}\) to get many “surrogate data” sets \(\tilde{X}\)
- Look at the distribution of \(\tau(\tilde{X})\) over your simulations
- If \(\tau(X)\) is really unlikely under that simulation, it’s evidence against \(\mathcal{H}_0\)
- When would it be evidence for \(\mathcal{H}_1\)?
- This is a test of the hypothesis \(\mathcal{H}_0\) against the alternative \(\mathcal{H}_1\)
- The term “surrogate data” comes out of physics (Theiler et al. 1992)
- testing the null of “this is a linear stochastic process” vs. “this is a deterministic, but chaotic, nonlinear dynamical system”
Backup: Incorporating uncertainty in the initial conditions
- Simulating a dynamical model needs an initial value \(X(0)\) (or \(X(1)\), or maybe \(X(1), X(2)\), etc.)
- What if this isn’t known precisely?
- Easy case: it follows a probability distribution
- Then draw \(X(0)\) from that distribution independently on each run
- Aggregate across runs as usual
- Less easy: \(X(0)\) within known limits but no distribution otherwise
- My advice: draw \(X(0)\) uniformly from inside the region
- Rationale 1: Maximum entropy distribution compatible with constraints
- Rationale 2: If the process is rapidly mixing, memory of the distribution of \(X(0)\) decays quickly anyway
Backup: Improving on the Crude Confidence Interval
- Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
- But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
- Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than the distribution of \(\tilde{\theta}\) is to \(\hat{\theta}\)
- (“centering” or “pivoting”)
Backup: The Basic, Pivotal CI
quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)
\[
\begin{eqnarray*}
1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\
& = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\
& \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\
& = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\
& = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0}
\end{eqnarray*}
\]
Basically: re-center the simulations around the empirical data
## a b1 b2
## -33.149997 1.543377 -1.152968
## a b1 b2
## 786.554385 1.928635 -0.852362
Backup: Further reading
- Bootstrap begins with Efron (1979) (drawing on an older idea from the 1940s called the “jackknife”)
- Block bootstrap for time series comes from Künsch (1989)
- Bühlmann (2002) is a good review on time series bootstraps
- Davison and Hinkley (1997) is the best over-all textbook on the bootstrap
- Lahiri (2003) is very theoretical
References
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.
Lahiri, S. N. 2003. Resampling Methods for Dependent Data. New York: Springer-Verlag.
Theiler, James, Bryan Galdrikian, André Longtin, Stephen Eubank, and J. Doyne Farmer. 1992. “Using Surrogate Data to Detect Nonlinearity in Time Series.” In Nonlinear Modeling and Forecasting: Proceedings of the Workshop on Nonlinear Modeling and Forecasting Held September, 1990 in Santa Fe, New Mexico, edited by Martin Casdagli and Stephen Eubank. Reading, Massachusetts: Addison-Wesley.