Testing a parametric regression model

36-402, Spring 2025

11 February 2025

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \]

The general problem

You have a parametric regression model for your data
- e.g., linear with such-and-such variables
You are worried that it might be misspecified, that the true \(\TrueRegFunc(x)\) isn’t in the model
Now that we know nonparametric regression, we can test this

First case: linear specification is wrong

# Make up sample data
x <- sort(runif(300, 0, 3)) # Sorting simplifies plotting later
# Impose true regression function $\log{(x+1)} +$ Gaussian noise:
yg <- log(x+1) + rnorm(length(x), 0, 0.15)
# Bind into a data frame:
gframe <- data.frame(x=x, y=yg)

What this looks like

Plot it, plus the true regression curve

plot(y~x, data=gframe, xlab="x", ylab="y", pch=16, cex=0.5)
curve(log(1+x), col="grey", add=TRUE, lwd=4)

What linear fit do we get?

Fit the linear model

glinfit <- lm(y~x, data=gframe)
print(summary(glinfit), signif.stars=FALSE, digits=2)

## 
## Call:
## lm(formula = y ~ x, data = gframe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -0.52  -0.10   0.01   0.11   0.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.172      0.018     9.5   <2e-16
## x              0.445      0.011    42.0   <2e-16
## 
## Residual standard error: 0.17 on 298 degrees of freedom
## Multiple R-squared:  0.86,   Adjusted R-squared:  0.86 
## F-statistic: 1.8e+03 on 1 and 298 DF,  p-value: <2e-16

(N.B.: everything is crazily significant, but the linear model is wrong.)

What the mis-specified linear fit looks like

plot(x, yg, xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
legend("bottomright", legend=c("Truth", "Linear"),
       col=c("grey", "black"), lwd=4)

MSE of linear model, in-sample: 0.0271.

We’ll need to do that a lot, so make it a function:

mse.residuals <- function(model) { mean(residuals(model)^2) }

OFFLINE: Write comments giving the inputs and outputs of ms.residuals()
OFFLINE: Check that the function works (how?)

Fit the non-parametric alternative; use splines as implemented in gam():

library(mgcv)

gnpr <- gam(y~s(x), data=gframe)

Add the fitted values from the spline to the plot:

plot(x,yg,xlab="x", ylab="y")
curve(log(1+x), col="grey", add=TRUE, lwd=4)
abline(glinfit, lwd=4)
lines(x, fitted(gnpr), col="blue", lwd=4)
legend("bottomright", legend=c("Truth", "Linear", "Spline"),
       col=c("grey", "black", "blue"), lwd=4)

Calculate the difference in MSEs:

(D.hat <- mse.residuals(glinfit) - mse.residuals(gnpr))

## [1] 0.004832842

OFFLINE: Why do we believe this code works?

How do we test the linear model?

\(H_0\): \(\TrueRegFunc(x)\) is a linear function vs. \(H_A\): \(\TrueRegFunc(x)\) is not linear
Test statistic: Difference in MSEs \(=D\)
- If the linear model is right, it will predict at least as well as a non-parametric model
- If the linear model is wrong, eventually the non-parametric model will beat it
Need the sampling distribution of \(D\) under the null hypothesis
- Need to simulate the linear model
- And need to calculate \(D\) over and over

Simulate the parametric model

Simulate from the parametric model, assuming Gaussian noise

sim.lm <- function(linfit, test.x) {
  n <- length(test.x)
  sim.frame <- data.frame(x=test.x)
  sigma <- sqrt(mse.residuals(linfit))  # There are other ways to get sigma
  y.sim <- predict(linfit, newdata=sim.frame)
  y.sim <- y.sim + rnorm(n,0,sigma)
  sim.frame <- data.frame(sim.frame, y=y.sim)
  return(sim.frame)
}

OFFLINE: Write comments describing input and output of sim.lm()
OFFLINE: How can we check that sim.lm() is working?
OFFLINE: How can we switch to non-Gaussian noise?

Calculate the test statistic

Calculate difference in MSEs between parametric and nonparametric models on a data frame:

calc.D <- function(df) {
  MSE.p <- mse.residuals(lm(y~x, data=df))
  MSE.np <- mse.residuals(gam(y~s(x), data=df))
  return(MSE.p - MSE.np)
}

OFFLINE: Comments once again (with feeling this time)
OFFLINE: Check that this function works properly (how?)

Put the pieces together

Calculate the MSE difference on one simulation run:

calc.D(sim.lm(glinfit, x))

## [1] 0.0003025062

Calculate the MSE difference on 200 simulation runs, so we get samples from the null-hypothesis distribution:

null.samples.D <- replicate(200, calc.D(sim.lm(glinfit, x)))

How often does the simulation produce gaps bigger than what we really saw?

mean(null.samples.D >= D.hat)

## [1] 0

Look at the distribution

Plot histogram of the sampling distribution, and the observed value:

hist(null.samples.D,n=31,xlim=c(min(null.samples.D),1.2*D.hat),probability=TRUE,
     main="Distribution of D under the null model",
     xlab="D")
abline(v=D.hat, col="red")

Second case: linear model is properly specified

Deliberately left uncommented so that you can explore

y2 <- 0.2+0.5*x + rnorm(length(x),0,0.15)
  # Why don't I have to make up x values again?
y2.frame <- data.frame(x=x,y=y2)
plot(x,y2,xlab="x",ylab="y",pch=16)
abline(0.2,0.5,col="grey",lwd=4)

y2.fit <- lm(y~x,data=y2.frame)
null.samples.D.y2 <- replicate(200,calc.D(sim.lm(y2.fit,x)))
(D.hat2 <- calc.D(y2.frame))

## [1] 2.964261e-13

hist(null.samples.D.y2,n=31,probability=TRUE, main="Distribution of D when the linear model is right", xlab="D")
abline(v=D.hat2, col="red")

mean(null.samples.D.y2 >= D.hat2)

## [1] 0.84

Extensions/lessons

Don’t need to guess at specific nonlinearities; let the smoother figure it out
Extends to testing a linear model with multiple predictors easily
- Choice of whether the alternative then is an additive model vs. a general nonparametric regression
- Additive: more power to detect additive nonlinearities, can easily miss nonlinear interactions
- General model: sensitive to more mis-specifications but less power over-all (curse of dimensionality again)
Nothing special about linear models — the same idea works for testing any parametric regression model

Another approach, based on residuals

\[ Y = \TrueRegFunc(X) + \epsilon, ~ \Expect{\epsilon|X} = 0 \]

\(\Rightarrow\) if we’ve got the right model, \(\Expect{Y-\TrueRegFunc(X)|X=x} = 0\) for all \(x\)
\(\Rightarrow\) smooth the residuals and see if we get (close to) 0

par(mfrow=c(1,2))
plot(gam(residuals(glinfit) ~ s(x)),residuals=TRUE,
  lwd=4,ylab="Residuals of the linear model", main="Reality nonlinear")
abline(h=0,col="grey")
plot(gam(residuals(y2.fit) ~ s(x)), residuals=TRUE,
     lwd=4, ylab="Residuals of the linear model", main="Reality linear")
abline(h=0,col="grey")

Smoothing residuals, cont’d

Fit a parametric model \(\EstRegFunc\)
Calculate residuals, \(e_i \equiv y_i - \EstRegFunc(x_i)\)
Smooth \(e_i\) against \(x_i\) to get a function \(r\)
\(r(x) \approx \Expect{Y- \EstRegFunc(X)|X=x}\)
We’d need to calculate how far the smoothed-residuals curve is from 0
- Once common choice is the \(L_2\) norm, \(\|r\|^2_2 \equiv \int{r^2(x) dx}\), or \(\int{r^2(x) f(x) dx}\) to weight by the pdf \(f(x)\)
- Could approximate the weighted version by \(\frac{1}{n}\sum_{i=1}^{n}{r^2(x_i)}\) (why?)
OFFLINE: Write (and comment!) code to do this
Then simulate to get a null distribution for that test statistic

Take-aways

If we’re using parametric models, we should test them
- Nothing in summary.lm() tests the linearity of the model
We can test parametric models by comparing to non-parametric alternatives
Parametric model should predict at least as well, if it is right
Otherwise, the non-parametric model should predict better
Alternatively, do non-parametric smoothing of the residuals
Sampling distributions are hard, so simulate (under the null)

Backup: Interactions in `gam()`

Can extend the additive model to include interaction terms: \[ Y = \alpha + \sum_{j=1}^{p}{f_j(x_j)} + \sum_{j=1}^{p}{\sum_{k=j+1}^{p}{f_{jk}(x_j, x_k)}} + \epsilon \]
Estimate \(f_{jk}\) by taking partial residuals and doing a 2D smooth against \(x_j\) and \(x_k\)
How do we do this in R?
- s(xj, xk) fits a 2D (“thin plate”) spline, good if xj and xk are on the same scale
- te(xj, xk) fits a “tensor-product” spline, good if xj and xk are measured on different scales
When do you want interactions?
- When it makes sense (location interacts latitude and longitude)
- Diagonstic: scatter-plot residuals vs. xj and xk, look for patterns
Reading: textbook, section 8.5

Backup: Convergence when we add interactions to a GAM

Convergence rate for a \(k\)-dimensional smoother is \(O(n^{-4/(4+k)})\)
MSE of true regression function \(\TrueRegFunc\) is \(\sigma^2\)
For the additive model (w/ no interactions), \[ MSE = \sigma^2 + (\text{approximation bias of best additive model})^2 + O(pn^{-4/5}) \]
For the model with all two-way interactions, \[ MSE = \sigma^2 + (\text{approximation bias of best two-way model})^2 + O(p^2 n^{-4/6}) \]
- There are \({p \choose 2} = O(p^2)\) possible interactions, each converging at rate \(O(n^{-4/6})\), and these dominate the \(O(n^{-4/5})\) of the additive terms
For the model with all \(k\)-way interactions, \[ MSE = \sigma^2 + (\text{approximation bias})^2 + O\left({p \choose k} n^{-4/(4+k)}\right) \]