Simulation for Inference I — The Bootstrap

36-467/667

22 October 2020 (Lecture 16)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Probwrt}[2]{\mathbb{P}_{#2}\left( #1 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

In our last episode

Agenda

Statistical Uncertainty

Measures of Uncertainty

  1. Either the true parameter is in the confidence region
  2. or we are very unlucky
    • (Something happened whose probability was very low under all parameter values)
  3. or our model is wrong

The Sampling Distribution Is the Source of All Knowledge

The Problems

  1. Most of the time, \(P_{X,\theta_0}\) is very complicated
  2. Most of the time, \(\tau\) is a very complicated function
  3. We certainly don’t know \(\theta_0\)

The Solutions

The Bootstrap Principle

  1. Find a good estimate \(\hat{P}\) for \(P_{X,\theta_0}\)
  2. Generate a simulation \(\tilde{X}\) from \(\hat{P}\); set \(\tilde{T} = \tau(\tilde{X})\); repeat
  3. Use the simulated distribution of the \(\tilde{T}\) to approximate \(P_{T,\theta_0}\)

Model-based Bootstrap

If we are using a model, our best guess at \(P_{X,\theta_0}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters

The Model-based Bootstrap

Example: lynxes in Canada

Example: lynxes in Canada

data(lynx)
plot(lynx, xlab="Year", ylab="Lynxes", lwd=2)

Fitting an AR model to the example data

lynx.ar2 <- ar.ols(lynx, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
lynx.ar2$ar
## , , 1
## 
##           [,1]
## [1,]  1.152423
## [2,] -0.606229
lynx.ar2$x.intercept
## [1] 710.1056

so the estimated model is \[ X(t) = 710.1055888 + 1.1524226 X(t-1) - 0.606229 X(t-2) + \epsilon(t) \]

Simulating from the fitted model

lynx.sims <- replicate(5, ar2.sim(n=length(lynx),
                                   a=lynx.ar2$x.intercept,
                                   b=lynx.ar2$ar,
                                   sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
                                   x.start=lynx[1:2]))
plot(lynx, xlab="Year", ylab="Lynxes caught", lwd=2)
for (i in 1:ncol(lynx.sims)) {
    lines(1821:1934, lynx.sims[,i], col="grey", lty=i)
}
abline(h=0, col="blue", lwd=0.5)

Uncertainty in the coefficients

# Estimate the coefficients of an AR(2) model
# Inputs: a time series (x)
# Outputs: vector containing intercept and 2 AR slopes
ar2.estimator <- function(x) {
    # Treat the input exactly the same way we treated the real data
    fit <- ar.ols(x, order.max=2, aic=FALSE, demean=FALSE, intercept=TRUE)
    return(c(a=fit$x.intercept, b1=fit$ar[1], b2=fit$ar[2]))
}

Check that this gives the right answer on the data, and a different answer on a simulation:

ar2.estimator(lynx)
##          a         b1         b2 
## 710.105589   1.152423  -0.606229
ar2.estimator(lynx.sims[,1])
##          a         b1         b2 
## 648.795789   1.182618  -0.592997

Uncertainty in the coefficients

Now do lots of simulations, and apply this function to each simulation:

lynx.coef.sims <- replicate(1000, ar2.estimator( ar2.sim(n=length(lynx),
                                   a=lynx.ar2$x.intercept,
                                   b=lynx.ar2$ar,
                                   sd.innovation=sd(lynx.ar2$resid, na.rm=TRUE),
                                   x.start=lynx[1:2])))
dim(lynx.coef.sims)
## [1]    3 1000
lynx.coef.sims[1:3, 1:5]
##           [,1]        [,2]        [,3]        [,4]        [,5]
## a  619.2030763 728.1824888 933.0930750 608.3708893 709.5767510
## b1   1.1859839   1.1591091   1.1406471   1.2823124   1.1297578
## b2  -0.5645652  -0.6534608  -0.6916214  -0.6258483  -0.5969703

Uncertainty in the coefficients

Standard errors \(\approx\) standard deviation across simulations:

apply(lynx.coef.sims, 1, sd)
##            a           b1           b2 
## 127.16843433   0.07891514   0.07488047

95% confidence intervals \(\approx\) quantiles across simulations:

apply(lynx.coef.sims, 1, quantile, prob=c(0.025, 0.975))
##              a        b1         b2
## 2.5%  499.1899 0.9811642 -0.7387225
## 97.5% 989.8220 1.2836256 -0.4517181

For Gaussian AR\((p)\) models…

Prediction

Resampling

The Resampling (“Nonparameteric”) Bootstrap

Some useful code for block resampling

# Convert a time series into a data frame of lagged values
# Inputs: a time series (ts)
  # maximum lag to use (order)
  # whether older values go on the right or the left (right.older)
# Output: a data frame with (order+1) columns, named lag0, lag1, ... , and
  # length(ts)-order rows
design.matrix.from.ts <- function(ts,order,right.older=TRUE) {
  n <- length(ts)
  x <- ts[(order+1):n]
  for (lag in 1:order) {
    if (right.older) {
      x <- cbind(x,ts[(order+1-lag):(n-lag)])
    } else {
      x <- cbind(ts[(order+1-lag):(n-lag)],x)
    }
  }
  lag.names <- c("lag0",paste("lag",1:order,sep=""))
  if (right.older) {
    colnames(x) <- lag.names
  } else {
    colnames(x) <- rev(lag.names)
  }
  return(as.data.frame(x))
}

Example: Back to the lynxes

head(lynx)
## [1]  269  321  585  871 1475 2821
head(design.matrix.from.ts(lynx,2, right.older=FALSE))
##   lag2 lag1 lag0
## 1  269  321  585
## 2  321  585  871
## 3  585  871 1475
## 4  871 1475 2821
## 5 1475 2821 3928
## 6 2821 3928 5943

Example: Back to the lynxes

# Simple block bootstrap
# Inputs: time series (ts), block length, length of output
# Output: one resampled time series
# Presumes: ts is a univariate time series
rblockboot <- function(ts,block.length,len.out=length(ts)) {
  # chop up ts into blocks
  the.blocks <- as.matrix(design.matrix.from.ts(ts,block.length-1,
    right.older=FALSE))
    # look carefully at design.matrix.from.ts to see why we need the -1
  # How many blocks is that?
  blocks.in.ts <- nrow(the.blocks)
  # Sanity-check
  stopifnot(blocks.in.ts == length(ts) - block.length+1)
  # How many blocks will we need (round up)?
  blocks.needed <- ceiling(len.out/block.length)
  # Sample blocks with replacement
  picked.blocks <- sample(1:blocks.in.ts,size=blocks.needed,replace=TRUE)
  # put the blocks in the randomly-selected order
  x <- the.blocks[picked.blocks,]
  # convert from a matrix to a vector and return
    # need transpose because R goes from vectors to matrices and back column by
    # column, not row by row
  x.vec <- as.vector(t(x))
    # Discard uneeded extra observations at the end silently
  return(x.vec[1:len.out])
}
head(lynx)
## [1]  269  321  585  871 1475 2821
head(rblockboot(lynx, 3))
## [1] 3409 1824  409  674   81   80

Resampling the lynxes

lynx.coef.sims.resamp <- replicate(1000, ar2.estimator(rblockboot(lynx, 3)))
apply(lynx.coef.sims.resamp, 1, sd)
##            a           b1           b2 
## 209.95314712   0.09850266   0.07904893
apply(lynx.coef.sims.resamp, 1, quantile, prob=c(0.025, 0.975))
##               a        b1          b2
## 2.5%   633.6568 0.3762097 -0.36009610
## 97.5% 1453.3612 0.7614678 -0.05949003

Why should this work?

What about non-stationary time series?

What about spatial data?

Sources of Error in Bootstrapping

Summary

Backup: Some General Ways of Improving the Bootstrap

Backup: The “surrogate data” method

Backup: Incorporating uncertainty in the initial conditions

Backup: Improving on the Crude Confidence Interval

Backup: The Basic, Pivotal CI

quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)

\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]

Basically: re-center the simulations around the empirical data

2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.975)
##          a         b1         b2 
## -33.149997   1.543377  -1.152968
2*ar2.estimator(lynx)-apply(lynx.coef.sims.resamp, 1, quantile, prob=0.025)
##          a         b1         b2 
## 786.554385   1.928635  -0.852362

Backup: Further reading

References

Bühlmann, Peter. 2002. “Bootstraps for Time Series.” Statistical Science 17:52–72. https://doi.org/10.1214/ss/1023798998.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge, England: Cambridge University Press.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” Annals of Statistics 7:1–26. https://doi.org/10.1214/aos/1176344552.

Künsch, Hans R. 1989. “The Jackknife and the Bootstrap for General Stationary Observations.” Annals of Statistics 17:1217–41. https://doi.org/10.1214/aos/1176347265.

Lahiri, S. N. 2003. Resampling Methods for Dependent Data. New York: Springer-Verlag.

Levina, Elizaveta, and Peter J. Bickel. 2006. “Texture Synthesis and Nonparametric Resampling of Random Fields.” Annals of Statistics 34:1751–73. http://projecteuclid.org/euclid.aos/1162567632.

Politis, Dimitris N., and Halbert White. 2004. “Automatic Block-Length Selection for the Dependent Bootstrap.” Econometric Reviews 23:53–70. https://doi.org/10.1081/ETC-120028836.

Theiler, James, Bryan Galdrikian, André Longtin, Stephen Eubank, and J. Doyne Farmer. 1992. “Using Surrogate Data to Detect Nonlinearity in Time Series.” In Nonlinear Modeling and Forecasting: Proceedings of the Workshop on Nonlinear Modeling and Forecasting Held September, 1990 in Santa Fe, New Mexico, edited by Martin Casdagli and Stephen Eubank. Reading, Massachusetts: Addison-Wesley.