Statistical Computing, 36-350
Wednesday October 21, 2015
Basic debugging tricks:
Better success through design!
Our two goals:
Both are important (don’t forget about the second!)
Statistical hypothesis testing: risk of false alarm (size) and probability of detection (power). This balances type I and type II errors
Software testing: no false alarms allowed. This is going to reduce our power to detect errors; code can pass all our tests and still be wrong
But! we can direct the power to detect certain errors, including where the error lies, if we test small pieces
When we have a version of the code which we are confident gets some cases right, we should keep it around (under a separate name)
Step 1: thnk about the big picture
Step 2: divide-and-conquer mentality
Step 3: assuming you’ve written each part, how would you put them together?
# Not actual code
big.job = function(lots.of.arguments) {
intermediate.result = first.step(some.of.the.args)
final.result = second.step(intermediate.result,rest.of.the.args)
return(final.result)
}
Step 4+: you don’t actually have a working program yet, but you have a good setup
Even if we didn’t start in top-down design mode, once we have some code, and it (more or less) works, re-write it to emphasize commonalities:
Re-factoring tends to make code look more like the result of top-down design. This is no accident!
mean.jackknife = function(vec) {
n = length(vec)
jackknife.ests = vector(length=n)
for (i in 1:n) {
jackknife.ests[i] = mean(vec[-i])
}
variance.of.ests = var(jackknife.ests)
jackknife.var = ((n-1)^2/n)*variance.of.ests
jackknife.stderr = sqrt(jackknife.var)
return(jackknife.stderr)
}
some.normals = rnorm(100,mean=7,sd=5)
mean(some.normals)
## [1] 6.237172
se.formula = sd(some.normals)/sqrt(length(some.normals))
se.jackknife = mean.jackknife(some.normals)
max(abs(se.formula-se.jackknife))
## [1] 3.330669e-16
Recall our friend the method of moments estimator:
gamma.est = function(the.data) {
m = mean(the.data)
v = var(the.data)
a = m^2/v
s = v/m
return(c(a=a,s=s))
}
gamma.jackknife = function(vec) {
n = length(vec)
jackknife.ests = matrix(NA,nrow=2,ncol=n)
rownames(jackknife.ests) = c("a","s")
for (i in 1:n) {
fit = gamma.est(vec[-i])
jackknife.ests["a",i] = fit["a"]
jackknife.ests["s",i] = fit["s"]
}
variance.of.ests = apply(jackknife.ests,1,var)
jackknife.vars = ((n-1)^2/n)*variance.of.ests
jackknife.stderrs = sqrt(jackknife.vars)
return(jackknife.stderrs)
}
data("cats",package="MASS")
gamma.est(cats$Hwt)
## a s
## 19.0653121 0.5575862
gamma.jackknife(cats$Hwt)
## a s
## 2.74062279 0.07829436
jackknife.lm = function(df,formula,p) {
n = nrow(df)
jackknife.ests = matrix(0,nrow=p,ncol=n)
for (i in 1:n) {
new.coefs = lm(as.formula(formula),data=df[-i,])$coefficients
jackknife.ests[,i] = new.coefs
}
variance.of.ests = apply(jackknife.ests,1,var)
jackknife.var = ((n-1)^2/n)*variance.of.ests
jackknife.stderr = sqrt(jackknife.var)
return(jackknife.stderr)
}
cats.lm = lm(Hwt~Bwt,data=cats)
coefficients(cats.lm)
## (Intercept) Bwt
## -0.3566624 4.0340627
# "Official" standard errors
sqrt(diag(vcov(cats.lm)))
## (Intercept) Bwt
## 0.6922770 0.2502615
jackknife.lm(df=cats,formula="Hwt~Bwt",p=2)
## [1] 0.8314142 0.3166847
figure out the size of the data
for each case
repeat some estimation, after omittingthat case
collect all estimates in a vector as you go
take variances across cases
scale up variances
take the square roots
Let’s define a function for the common operation of omitting one case
omit.case = function(dat,i) {
dims = dim(dat)
if (is.null(dims) || (length(dims)==1)) {
return(dat[-i])
} else {
return(dat[-i,])
}
}
(Exercise: modify so it also handles higher-dimensional arrays)
Let’s write a function for the general jackknife workflow
jackknife = function(estimator,dat) {
n = ifelse(is.null(dim(dat)),length(dat),nrow(dat))
omit.and.est = function(i) { estimator(omit.case(dat,i)) }
jackknife.ests = matrix(sapply(1:n, omit.and.est), ncol=n)
variance.of.ests = apply(jackknife.ests,1,var)
jackknife.var = ((n-1)^2/n)*variance.of.ests
jackknife.stderr = sqrt(jackknife.var)
return(jackknife.stderr)
}
Could allow other arguments to estimator
, spin off finding n
as its own function, etc.
jackknife(estimator=mean,dat=some.normals)
## [1] 0.5764657
max(abs(jackknife(estimator=mean,dat=some.normals) -
mean.jackknife(some.normals)))
## [1] 0
max(abs(jackknife(estimator=gamma.est,dat=cats$Hwt) -
gamma.jackknife(cats$Hwt)))
## [1] 0
est.coefs = function(dat) {
return(lm(Hwt~Bwt,data=dat)$coefficients)
}
est.coefs(cats)
## (Intercept) Bwt
## -0.3566624 4.0340627
max(abs(est.coefs(cats) - coefficients(cats.lm)))
## [1] 0
jackknife(estimator=est.coefs,dat=cats)
## [1] 0.8314142 0.3166847
max(abs((jackknife(estimator=est.coefs,dat=cats) -
jackknife.lm(df=cats,formula="Hwt~Bwt",p=2))))
## [1] 0
Note: we’ve been just cross-checking our code the whole time against our old code, to make sure it works