36-402, Spring 2025
11 February 2025
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \]
Plot it, plus the true regression curve
plot(y~x, data=gframe, xlab="x", ylab="y", pch=16, cex=0.5)
curve(log(1+x), col="grey", add=TRUE, lwd=4)
Fit the linear model
##
## Call:
## lm(formula = y ~ x, data = gframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52 -0.10 0.01 0.11 0.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.172 0.018 9.5 <2e-16
## x 0.445 0.011 42.0 <2e-16
##
## Residual standard error: 0.17 on 298 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.86
## F-statistic: 1.8e+03 on 1 and 298 DF, p-value: <2e-16
(N.B.: everything is crazily significant, but the linear model is wrong.)
plot(x, yg, xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
legend("bottomright", legend=c("Truth", "Linear"),
col=c("grey", "black"), lwd=4)
MSE of linear model, in-sample: 0.0271.
We’ll need to do that a lot, so make it a function:
ms.residuals()
Fit the non-parametric alternative; use splines as implemented in gam()
:
Add the fitted values from the spline to the plot:
plot(x,yg,xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
lines(x, fitted(gnpr), col="blue", lwd=4)
legend("bottomright", legend=c("Truth", "Linear", "Spline"),
col=c("grey", "black", "blue"), lwd=4)
Calculate the difference in MSEs:
## [1] 0.004832842
Simulate from the parametric model, assuming Gaussian noise
sim.lm <- function(linfit, test.x) {
n <- length(test.x)
sim.frame <- data.frame(x=test.x)
sigma <- sqrt(mse.residuals(linfit)) # There are other ways to get sigma
y.sim <- predict(linfit, newdata=sim.frame)
y.sim <- y.sim + rnorm(n,0,sigma)
sim.frame <- data.frame(sim.frame, y=y.sim)
return(sim.frame)
}
sim.lm()
sim.lm()
is working?Calculate difference in MSEs between parametric and nonparametric models on a data frame:
calc.D <- function(df) {
MSE.p <- mse.residuals(lm(y~x, data=df))
MSE.np <- mse.residuals(gam(y~s(x), data=df))
return(MSE.p - MSE.np)
}
Calculate the MSE difference on one simulation run:
## [1] 0.0003025062
Calculate the MSE difference on 200 simulation runs, so we get samples from the null-hypothesis distribution:
How often does the simulation produce gaps bigger than what we really saw?
## [1] 0
Plot histogram of the sampling distribution, and the observed value:
hist(null.samples.D,n=31,xlim=c(min(null.samples.D),1.2*D.hat),probability=TRUE,
main="Distribution of D under the null model",
xlab="D")
abline(v=D.hat, col="red")
Deliberately left uncommented so that you can explore
y2 <- 0.2+0.5*x + rnorm(length(x),0,0.15)
# Why don't I have to make up x values again?
y2.frame <- data.frame(x=x,y=y2)
plot(x,y2,xlab="x",ylab="y",pch=16)
abline(0.2,0.5,col="grey",lwd=4)
y2.fit <- lm(y~x,data=y2.frame)
null.samples.D.y2 <- replicate(200,calc.D(sim.lm(y2.fit,x)))
(D.hat2 <- calc.D(y2.frame))
## [1] 2.964261e-13
hist(null.samples.D.y2,n=31,probability=TRUE, main="Distribution of D when the linear model is right", xlab="D")
abline(v=D.hat2, col="red")
## [1] 0.84
\[ Y = \TrueRegFunc(X) + \epsilon, ~ \Expect{\epsilon|X} = 0 \]
par(mfrow=c(1,2))
plot(gam(residuals(glinfit) ~ s(x)),residuals=TRUE,
lwd=4,ylab="Residuals of the linear model", main="Reality nonlinear")
abline(h=0,col="grey")
plot(gam(residuals(y2.fit) ~ s(x)), residuals=TRUE,
lwd=4, ylab="Residuals of the linear model", main="Reality linear")
abline(h=0,col="grey")
summary.lm()
tests the linearity of the modelgam()
s(xj, xk)
fits a 2D (“thin plate”) spline, good if xj
and xk
are on the same scalete(xj, xk)
fits a “tensor-product” spline, good if xj
and xk
are measured on different scalesxj
and xk
, look for patterns