36-402, Section A
5 February 2019
\[ \newcommand{\Expect}[1]{\mathbf{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)} \newcommand{\Probwrt}[2]{\mathrm{Pr}_{#2}\left( #1 \right)} \]
Classically (\(\approx 1900\)–\(\approx 1975\)): Restrict the model and the statistic until you can calculate the sampling distribution, at least for very large \(n\)
Modern (\(\approx 1975\)–): Use complex models and statistics, but simulate calculating the statistic on the model
If we are using a model, our best guess at \(P_{X}\) is \(P_{X,\hat{\theta}}\), with our best estimate \(\hat{\theta}\) of the parameters
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
## [1] 3.521869
Simulate from fitted Gaussian; bundle up estimating 95th percentile into a function
Simulate, plot the sampling distribution from the simulations
sampling.dist.gaussian <- replicate(1000, est.q95.gaussian(rcats.gaussian()))
plot(hist(sampling.dist.gaussian,breaks=50, plot=FALSE), freq=FALSE)
lines(density(sampling.dist.gaussian),lwd=2)
abline(v=q95.gaussian,lty="dashed",lwd=4)
Find standard error and a (crude) confidence interval
## [1] 0.0626791
## 2.5% 97.5%
## 3.408126 3.648369
As always, if the model isn’t right, relying on the model is asking for trouble
Is the Gaussian a good model for cats’ weights?
Compare histogram to fitted Gaussian density and to a smooth density estimate
Difficulty: We might not have a trust-worthy model
Resource: We do have data, which tells us a lot about the distribution
Solution: Resampling, treat the sample like a whole population
Model-free estimate of the 95th percentile = 95th percentile of the data
## 95%
## 3.6
How precise is that?
Resampling, re-estimating, and finding sampling distribution
sampling.dist.np <- replicate(1000, est.q95.np(resample(cats$Bwt)))
plot(density(sampling.dist.np), main="", xlab="Body weight (kg)")
abline(v=q95.np,lty=2)
standard error, bias, CI
## [1] 0.07920881
## [1] -0.023945
## 2.5% 97.5%
## 3.400 3.785
cats
has weights for cats’ hearts, as well as bodies
(Much cuter than any photo of real cats’ hearts)
How does heart weight relate to body weight?
(Useful if Kara’s vet wants to know how much heart medicine to prescribe)
plot(Hwt~Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
cats.lm <- lm(Hwt ~ Bwt, data=cats)
abline(cats.lm)
## (Intercept) Bwt
## -0.3566624 4.0340627
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
The residuals don’t look very Gaussian:
Resample residuals:
sim.cats.resids <- function() {
new.cats <- cats
noise <- resample(residuals(cats.lm))
new.cats$Hwt <- fitted(cats.lm) + noise
return(new.cats)
}
Re-estimate on new data:
Re-sample to get CIs:
cats.lm.samp.dist.resids <- replicate(1000,
coefs.cats.lm(sim.cats.resids()))
t(apply(cats.lm.samp.dist.resids,1,quantile,c(0.025,0.975)))
## 2.5% 97.5%
## (Intercept) -1.716803 0.9534706
## Bwt 3.569200 4.5458607
Try resampling whole rows:
cats.lm.samp.dist.cases <- replicate(1000,
coefs.cats.lm(resample.data.frame(cats)))
t(apply(cats.lm.samp.dist.cases,1,quantile,c(0.025,0.975)))
## 2.5% 97.5%
## (Intercept) -2.052301 1.196093
## Bwt 3.410005 4.693054
## 2.5 % 97.5 %
## (Intercept) -1.725163 1.011838
## Bwt 3.539343 4.528782
## 2.5% 97.5%
## (Intercept) -1.716803 0.9534706
## Bwt 3.569200 4.5458607
## 2.5% 97.5%
## (Intercept) -2.052301 1.196093
## Bwt 3.410005 4.693054
Crude CI use distribution of \(\tilde{\theta}\) under \(\hat{\theta}\)
But really we want the distribution of \(\hat{\theta}\) under \(\theta\)
Mathematical observation: Generally speaking, (distributions of) \(\tilde{\theta} - \hat{\theta}\) is closer to \(\hat{\theta}-\theta_0\) than \(\tilde{\theta}\) is to \(\hat{\theta}\)
(“centering” or “pivoting”)
quantiles of \(\tilde{\theta}\) = \(q_{\alpha/2}, q_{1-\alpha/2}\)
\[ \begin{eqnarray*} 1-\alpha & = & \Probwrt{q_{\alpha/2} \leq \tilde{\theta} \leq q_{1-\alpha/2}}{\hat{\theta}} \\ & = & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \tilde{\theta} - \hat{\theta} \leq q_{1-\alpha/2} - \hat{\theta}}{\hat{\theta}} \\ & \approx & \Probwrt{q_{\alpha/2} - \hat{\theta} \leq \hat{\theta} - \theta_0 \leq q_{1-\alpha/2} - \hat{\theta}}{\theta_0}\\ & = & \Probwrt{q_{\alpha/2} - 2\hat{\theta} \leq -\theta_0 \leq q_{1-\alpha/2} - 2\hat{\theta}}{\theta_0}\\ & = & \Probwrt{2\hat{\theta} - q_{1-\alpha/2} \leq \theta_0 \leq 2\hat{\theta}-q_{\alpha/2}}{\theta_0} \end{eqnarray*} \]
Basically: re-center the simulations around the empirical data