Logistic Regression to Generalized Additive Models

36-402, Sec. A, Spring 2019

7 March 2019

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \]

A cat picture

The logistic curve

\[ \frac{e^u}{1+e^u} = \frac{1}{1+e^{-u}} \]

The logistic regression model

\[\begin{eqnarray} p(x) & = & \frac{e^{\beta_0 + \beta\cdot x}}{1+e^{\beta_0 + \beta\cdot x}}\\ g(p) & \equiv & \log{\frac{p}{1-p}} = \beta_0 + \beta\cdot x \end{eqnarray}\]

Why the log-odds ratio, of all things?

Interpretation

Example

ch <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/exams/1/ch.csv")
ch <- ch[, -1]  # First column is just an index
ch <- na.omit(ch)  # Not rec'd for exam but simplifies a few steps
ch.logistic <- glm(start ~ exports + fractionalization * dominance, data = ch, 
    family = "binomial")
coefficients(ch.logistic)
##                 (Intercept)                     exports 
##               -3.0397185828                0.2824995978 
##           fractionalization                   dominance 
##                0.0001504523                0.5834949415 
## fractionalization:dominance 
##               -0.0002603135

Have another cat picture

Residuals

Residuals: Response

plot(ch$exports, residuals(ch.logistic, type = "response"), xlab = "Exports", 
    ylab = "Residuals", main = "Response residuals")
abline(h = 0, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "response")))

Residuals: Pearson

plot(ch$exports, residuals(ch.logistic, type = "pearson"), xlab = "Exports", 
    ylab = "Residuals", main = "Pearson residuals")
abline(h = 0, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "pearson")))

Squared Residuals: Pearson

plot(ch$exports, residuals(ch.logistic, type = "pearson")^2, xlab = "Exports", 
    ylab = "Squared residuals", main = "Squared Pearson residuals")
abline(h = 1, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "pearson")^2))

Classification

mean(ifelse(fitted(ch.logistic) < 0.5, 0, 1) != ch$start)
## [1] 0.06686047

Is this good or bad?

mean(ch$start)
## [1] 0.06686047

EXERCISE: Why does the model have the same error rate as the constant?

Calibration

Checking calibration

frequency.vs.probability <- function(p.lower, p.upper = p.lower + 0.005, model, 
    events) {
    fitted.probs <- fitted(model)
    indices <- (fitted.probs >= p.lower) & (fitted.probs < p.upper)
    matching.probs <- fitted.probs[indices]
    ave.prob <- mean(matching.probs)
    frequency <- mean(events[indices])
    # 'Law of total variance': Var[Y]=E[Var[Y|X]] + Var[E[Y|X]]
    total.var <- mean(matching.probs * (1 - matching.probs)) + var(matching.probs)
    se <- sqrt(total.var/sum(indices))
    return(c(frequency = frequency, ave.prob = ave.prob, se = se))
}

(Can you add comments?)

Now apply the function a bunch (why these numbers?)

f.vs.p <- sapply(seq(from = 0.04, to = 0.12, by = 0.005), frequency.vs.probability, 
    model = ch.logistic, events = ch$start)

This is “turned on its side” from a data frame, so let’s fix. (Can you find a more elegant way to do this?)

f.vs.p <- data.frame(frequency = f.vs.p["frequency", ], ave.prob = f.vs.p["ave.prob", 
    ], se = f.vs.p["se", ])

plot(frequency ~ ave.prob, data = f.vs.p, xlim = c(0, 1), ylim = c(0, 1), xlab = "Predicted probabilities", 
    ylab = "Observed frequencies")
rug(fitted(ch.logistic), col = "grey")
abline(0, 1, col = "grey")
segments(x0 = f.vs.p$ave.prob, y0 = f.vs.p$ave.prob - 1.96 * f.vs.p$se, y1 = f.vs.p$ave.prob + 
    1.96 * f.vs.p$se)

Have another cat photo

How do we maximize the log-likelihood?

How do we maximize (cont’d)?

Iterative weighted least squares / Fisher scoring

  1. Start with a guess about \(\beta\)
  2. Calculate \(p(x_i)\), \(g^{\prime}(p(x_i))\) for each \(i\)
  3. Create \[\begin{eqnarray} z_i & = & g(p(x_i)) + (y_i - p(x_i))g^{\prime}(p)\\ w_i & = & p(x_i)(1-p(x_i)) \left(g^{\prime}(p(x_i))\right)^2 \end{eqnarray}\]
  4. Minimize over \((b_0, b)\) to get the new \((\beta_0, \beta)\): \[ \sum_{i=1}^{n}{\frac{(z_i - (b_0 + b \cdot x_i))^2}{w_i}} \]
  5. Go back to (1) until the predictions \(p(x_i)\) stop changing

Why does IWLS work?

Generalized additive model

\[ g(p) = \alpha + \sum_{j=1}^{p}{f_j(x_j)} \]

Example

library(mgcv)
ch.gam.1 <- gam(start ~ s(exports) + s(fractionalization) + s(fractionalization, 
    by = dominance), data = ch, family = "binomial")
plot(ch.gam.1)

A better model

ch.gam.2 <- gam(start ~ s(peace) + s(lnpop), data = ch, family = "binomial")

Generalize!

\[\begin{eqnarray} \epsilon(x) & \equiv & \Expect{Y|X=x}\\ \eta(x) & \equiv & \beta_0 + x \cdot \beta ~ (\text{or an additive model or whatever})\\ \eta(x) & = & g(\epsilon(x))\\ Z & \equiv & g(\epsilon(x)) + (Y-\epsilon(x)) g^{\prime}(\epsilon(x))\\ & = & \eta(x) + (Y-\epsilon(x)) g^{\prime}(\epsilon(x))\\ \Expect{Z|X=x} & = & \eta(x)\\ \Var{Z|X=x} & = & \left( g^{\prime}(\epsilon(x)) \right)^2 \Var{Y|X=x} \end{eqnarray}\]

We can do IWLS to recover the \(\eta\) function, without transforming the response

Where does \(g()\) come from?

Example: Poisson regression

Example: Poisson regression

library(gamair)
data(chicago)
chicago.gam <- gam(death ~ s(tmpd) + s(so2median) + s(o3median) + s(pm10median), 
    data = chicago, family = "poisson")

Example: Poisson regression

plot(chicago.gam)

Example: Poisson regression

plot(death ~ time, data = chicago)
lines(chicago$time, predict(chicago.gam, newdata = chicago, type = "response", 
    na.action = na.pass), type = "l", col = "red")

Example: Poisson regression

plot(residuals(chicago.gam, type = "pearson"))