The logistic regression model

\[\begin{eqnarray} p(x) & = & \frac{e^{\beta_0 + \beta\cdot x}}{1+e^{\beta_0 + \beta\cdot x}}\\ g(p) & \equiv & \log{\frac{p}{1-p}} = \beta_0 + \beta\cdot x \end{eqnarray}\]

Why the log-odds ratio, of all things?

Start with the likelihood = probability of \(Y_i\) given \(X_i\), as a function of parameters: \[\begin{eqnarray} \Prob{Y_1=y_1, \ldots Y_n=y_n|X_1=x_1, \ldots X_n=x_n} & = & \prod_{i=1}^{n}{\Prob{Y_i=y_i|X_i=x_i}}\\ & = & \prod_{i=1}^{n}{p(x_i)^{y_i} (1-p(x_i))^{1-y_i}} \end{eqnarray}\]
Take the log for the usual reasons: \[\begin{eqnarray} \ell & = & \sum_{i=1}^{n}{y_i\log{(p(x_i))} + (1-y_i)\log{(1-p(x_i))}}\\ & = & \sum_{i=1}^{n}{\log{(1-p(x_i))} + y_i\left(\log{(p(x_i))} - \log{(1-p(x_i))}\right)}\\ & = & \sum_{i=1}^{n}{\log{(1-p(x_i))} + y_i\log{\left(\frac{p(x_i)}{1-p(x_i)}\right)}} \end{eqnarray}\]
\(\ell\) depends on \(y_i\) only through \(\log{\left(\frac{p(x_i)}{1-p(x_i)}\right)} =\) log-odds ratio
- log-odds ratio is the natural parameter
Logistic regression = make the natural parameter linear in \(x\): \[ \ell(\beta_0, \beta) = \sum_{i=1}^{n}{-\log{\left(1+e^{\beta_0 + x_i \cdot \beta}\right)}} + \sum_{i=1}^{n}{y_i (\beta_0 + x_i \cdot \beta)} \]

Interpretation

\(\beta_j\) is how much a one-unit difference in \(x_j\) changes the predicted odds ratio for \(Y\)
- not the predicted change in \(Y\)
- Predicted change in \(Y\) depends on the start point

Example

ch <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/19/exams/1/ch.csv")
ch <- ch[, -1]  # First column is just an index
ch <- na.omit(ch)  # Not rec'd for exam but simplifies a few steps
ch.logistic <- glm(start ~ exports + fractionalization * dominance, data = ch, 
    family = "binomial")
coefficients(ch.logistic)

##                 (Intercept)                     exports 
##               -3.0397185828                0.2824995978 
##           fractionalization                   dominance 
##                0.0001504523                0.5834949415 
## fractionalization:dominance 
##               -0.0002603135

Have another cat picture

Residuals

“Response” residuals \(\widehat{\epsilon}_{i,\text{response}} \equiv y_i - p(x_i)\)
- Expectation \(=0\) if the model is right
- Variance isn’t constant even if the model is right
“Pearson” residuals \(\widehat{\epsilon}_{i,\text{Pearson}} \equiv \frac{y_i - p(x_i)}{\sqrt{p(x_i)(1-p(x_i))}}\)
- Expectation \(=0\) if the model is right
- Variance \(=1\) if the model is right
“Deviance” residuals \(\widehat{\epsilon}_{i,\text{deviance}} \equiv \sqrt{2\left(y_i\log{p(x_i)} + (1-y_i)\log{(1-p(x_i))}\right)} \mathrm{sgn}(y_i-p(x_i))\)
- Squared deviance residuals add up to twice the negative log-likelihood = “deviance”

Residuals: Response

plot(ch$exports, residuals(ch.logistic, type = "response"), xlab = "Exports", 
    ylab = "Residuals", main = "Response residuals")
abline(h = 0, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "response")))

Residuals: Pearson

plot(ch$exports, residuals(ch.logistic, type = "pearson"), xlab = "Exports", 
    ylab = "Residuals", main = "Pearson residuals")
abline(h = 0, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "pearson")))

Squared Residuals: Pearson

plot(ch$exports, residuals(ch.logistic, type = "pearson")^2, xlab = "Exports", 
    ylab = "Squared residuals", main = "Squared Pearson residuals")
abline(h = 1, col = "grey")
lines(smooth.spline(x = ch$exports, y = residuals(ch.logistic, type = "pearson")^2))

Classification

We classify each point as “should be \(Y=1\)” or “should be \(Y=0\)”

mean(ifelse(fitted(ch.logistic) < 0.5, 0, 1) != ch$start)

## [1] 0.06686047

Is this good or bad?

mean(ch$start)

## [1] 0.06686047

EXERCISE: Why does the model have the same error rate as the constant?

Calibration

Logistic regression predicts probabilities
Minimal test: When the model says \(\Prob{Y=1} = 0.5\), \(Y\) should be 1 half the time
Probabilities are calibrated when events with predicted probability \(p\) happen with frequency \(p\)
If each event has its own probability, group events with close probabilities

Checking calibration

frequency.vs.probability <- function(p.lower, p.upper = p.lower + 0.005, model, 
    events) {
    fitted.probs <- fitted(model)
    indices <- (fitted.probs >= p.lower) & (fitted.probs < p.upper)
    matching.probs <- fitted.probs[indices]
    ave.prob <- mean(matching.probs)
    frequency <- mean(events[indices])
    # 'Law of total variance': Var[Y]=E[Var[Y|X]] + Var[E[Y|X]]
    total.var <- mean(matching.probs * (1 - matching.probs)) + var(matching.probs)
    se <- sqrt(total.var/sum(indices))
    return(c(frequency = frequency, ave.prob = ave.prob, se = se))
}

(Can you add comments?)

Now apply the function a bunch (why these numbers?)

f.vs.p <- sapply(seq(from = 0.04, to = 0.12, by = 0.005), frequency.vs.probability, 
    model = ch.logistic, events = ch$start)

This is “turned on its side” from a data frame, so let’s fix. (Can you find a more elegant way to do this?)

f.vs.p <- data.frame(frequency = f.vs.p["frequency", ], ave.prob = f.vs.p["ave.prob", 
    ], se = f.vs.p["se", ])

plot(frequency ~ ave.prob, data = f.vs.p, xlim = c(0, 1), ylim = c(0, 1), xlab = "Predicted probabilities", 
    ylab = "Observed frequencies")
rug(fitted(ch.logistic), col = "grey")
abline(0, 1, col = "grey")
segments(x0 = f.vs.p$ave.prob, y0 = f.vs.p$ave.prob - 1.96 * f.vs.p$se, y1 = f.vs.p$ave.prob + 
    1.96 * f.vs.p$se)

Have another cat photo

How do we maximize the log-likelihood?

Take derivatives, set them to 0 \[ \frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^{n}{\left(y_i - p(x_i;\beta_0,\beta)\right) x_{ij}} \]
Like the “estimating” or “normal” equations for linear regression, \[ \frac{\partial MSE}{\partial \beta_j} = 2\sum_{i=1}^{n}{\left(y_i - \beta_0 - x_i \cdot \beta\right) x_{ij}} \]
- Those equations have a closed-form solution
- The estimating equations for logistic regression do not

How do we maximize (cont’d)?

Try to turn this into a least-squares problem
\(g(p)\) is linear in \(x\) — can we transform somehow?
- Note: \(y=0\) or \(y=1\) so \(g(y) = \pm \infty\)
Taylor approximation to the rescue: \[\begin{eqnarray} g(y) & \approx & g(p(x)) + (y-p(x))) g^{\prime}(p)\\ & = & (\beta_0 + \beta\cdot x) + (y-p(x)) g^{\prime}(p)\\ & \equiv & z \end{eqnarray}\]
\(z = \text{(linear predictor)} + \text{(mean-0 noise)}\)
Solution: Linearly regress \(z\) on \(x\)
- \(\Var{Z|X} = p(x)(1-p(x)) \left(g^{\prime}(p)\right)^2\) so use weighted least squares
- Weights, and \(Z\)’s, depend on current guess about \(\beta\)

Iterative weighted least squares / Fisher scoring

Start with a guess about \(\beta\)
Calculate \(p(x_i)\), \(g^{\prime}(p(x_i))\) for each \(i\)
Create \[\begin{eqnarray} z_i & = & g(p(x_i)) + (y_i - p(x_i))g^{\prime}(p)\\ w_i & = & p(x_i)(1-p(x_i)) \left(g^{\prime}(p(x_i))\right)^2 \end{eqnarray}\]
Minimize over \((b_0, b)\) to get the new \((\beta_0, \beta)\): \[ \sum_{i=1}^{n}{\frac{(z_i - (b_0 + b \cdot x_i))^2}{w_i}} \]
Go back to (1) until the predictions \(p(x_i)\) stop changing

Why does IWLS work?

It has a fixed point when we’ve got the right parameters
It’s also Newton’s method, applied to minimizing the negative log-likelihood
- \(\theta \leftarrow \theta - (\nabla \nabla \ell(\theta))^{-1} \nabla \ell(\theta)\)
- (Some details about exact vs. expected 2nd derivatives here)
So iterating this weighted least squares problem numerically maximizing the log-likelihood

Generalized additive model

\[ g(p) = \alpha + \sum_{j=1}^{p}{f_j(x_j)} \]

\(f_j\) are still partial response functions, but for log-odds of \(Y=1\), not for \(Y\)
We can use IWLS to estimate the \(f_j\): just fit an additive model for the \(z_i\)’s instead of a linear model

Example

library(mgcv)
ch.gam.1 <- gam(start ~ s(exports) + s(fractionalization) + s(fractionalization, 
    by = dominance), data = ch, family = "binomial")
plot(ch.gam.1)

A better model

ch.gam.2 <- gam(start ~ s(peace) + s(lnpop), data = ch, family = "binomial")

“Better” by cross-validated log-likelihood

Generalize!

\[\begin{eqnarray} \epsilon(x) & \equiv & \Expect{Y|X=x}\\ \eta(x) & \equiv & \beta_0 + x \cdot \beta ~ (\text{or an additive model or whatever})\\ \eta(x) & = & g(\epsilon(x))\\ Z & \equiv & g(\epsilon(x)) + (Y-\epsilon(x)) g^{\prime}(\epsilon(x))\\ & = & \eta(x) + (Y-\epsilon(x)) g^{\prime}(\epsilon(x))\\ \Expect{Z|X=x} & = & \eta(x)\\ \Var{Z|X=x} & = & \left( g^{\prime}(\epsilon(x)) \right)^2 \Var{Y|X=x} \end{eqnarray}\]

We can do IWLS to recover the \(\eta\) function, without transforming the response

Where does \(g()\) come from?

There’s some distribution for \(Y|X\)
- E.g., binomial for a binary \(Y\)
- Or maybe Poisson for a count-valued \(Y\)
- Or a gamma for positive-valued continuous \(Y\) or…
Look for the natural parameter of that distribution
- For binomial, \(\log{p/(1-p)}\) (as we saw)
- Needs some math
- Look at how that natural parameter relates to the mean

Example: Poisson regression

\(Y \sim \mathrm{Pois}(\lambda)\) iff \[ \Prob{Y=y} = \frac{\lambda^y}{y!}e^{-\lambda} \]
- \(\Expect{Y} = \lambda\), \(\Var{Y} = \lambda\)
- Natural model for counts (“law of rare events”)
Take log of probability \[ y\log{\lambda} - \log{(y!)} - \lambda \]
- \(\Rightarrow\) natural parameter is \(\log{\lambda}\)
- Set \(g(m) = \log{m}\)

Example: Poisson regression

Death in Chicago

library(gamair)
data(chicago)
chicago.gam <- gam(death ~ s(tmpd) + s(so2median) + s(o3median) + s(pm10median), 
    data = chicago, family = "poisson")

Example: Poisson regression

plot(chicago.gam)

Logistic Regression to Generalized Additive Models

A cat picture

The logistic curve

The logistic regression model

Why the log-odds ratio, of all things?

Interpretation

Example

Have another cat picture

Residuals

Residuals: Response

Residuals: Pearson

Squared Residuals: Pearson

Classification

Calibration

Checking calibration

Have another cat photo

How do we maximize the log-likelihood?

How do we maximize (cont’d)?

Iterative weighted least squares / Fisher scoring

Why does IWLS work?

Generalized additive model

Example

A better model

Generalize!

Where does \(g()\) come from?

Example: Poisson regression

Example: Poisson regression

Example: Poisson regression

Example: Poisson regression

Example: Poisson regression