Linear regression models, as we’ve said, are useful and ubiquitous. But there’s a lot else out there. What else?
Today we’ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they’re quite different, and you’ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course
Given response \(Y\) and predictors \(X_1,\ldots,X_p\), where \(Y \in \{0,1\}\) is a binary outcome. In a logistic regression model, we posit the relationship:
\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
(where \(Y|X\) is shorthand for \(Y|X_1,\ldots,X_p\)). Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)
Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):
pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
glm()
We can use glm()
to fit a logistic regression model. The arguments are very similar to lm()
The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp
, where Y
is the response and X1
, …, Xp
are the predictors. These refer to column names of variables in a data frame, that we pass in through the data
argument. We must also specify family="binomial"
to get logistic regression
E.g., for the prostate data, suppose we add a column lpsa.high
to our data frame pros.df
, as the indicator of whether the lpsa
variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable lpsa.high
onto the predictor variables lcavol
and lweight
:
pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome
table(pros.df$lpsa.high) # There are 56 men with high PSA scores
##
## 0 1
## 41 56
pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial")
class(pros.glm) # Really, a specialized list
## [1] "glm" "lm"
pros.glm # It has a special way of printing
##
## Call: glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Coefficients:
## (Intercept) lcavol lweight
## -12.551 1.520 3.034
##
## Degrees of Freedom: 96 Total (i.e. Null); 94 Residual
## Null Deviance: 132.1
## Residual Deviance: 75.44 AIC: 81.44
For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
work pretty much just as with lm()
. E.g.,
coef(pros.glm) # Logisitic regression coefficients
## (Intercept) lcavol lweight
## -12.551478 1.520006 3.034018
summary(pros.glm) # Special way of summarizing
##
## Call:
## glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7957 -0.6413 0.2192 0.5608 2.2209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.5515 3.2422 -3.871 0.000108 ***
## lcavol 1.5200 0.3604 4.218 2.47e-05 ***
## lweight 3.0340 0.8615 3.522 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.142 on 96 degrees of freedom
## Residual deviance: 75.436 on 94 degrees of freedom
## AIC: 81.436
##
## Number of Fisher Scoring iterations: 5
p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes
y.hat = round(p.hat) # This is one way we'd compute fitted outcomes
table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"
## y.true
## y.hat 0 1
## 0 33 5
## 1 8 51
How do you interpret a logistic regression coefficient? Note, from our logistic model:
\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]
So, increasing predictor \(X_j\) by one unit, while holding all other predictor fixed, multiplies the odds by \(e^{\beta_j}\). E.g.,
coef(pros.glm)
## (Intercept) lcavol lweight
## -12.551478 1.520006 3.034018
So, increasing lcavol
(log cancer volume) by one unit (one cc), while holding lweight
(log cancer weight) fixed, multiplies the odds of lpsa.high
(high PSA score, over 10) by \(\approx e^{1.52} \approx 4.57\)
We can easily create a binary variable “on-the-fly” by using the I()
function inside a call to glm()
(same would work with lm()
):
pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df,
family="binomial")
summary(pros.glm) # Same as before
##
## Call:
## glm(formula = I(lpsa > log(10)) ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7957 -0.6413 0.2192 0.5608 2.2209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.5515 3.2422 -3.871 0.000108 ***
## lcavol 1.5200 0.3604 4.218 2.47e-05 ***
## lweight 3.0340 0.8615 3.522 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.142 on 96 degrees of freedom
## Residual deviance: 75.436 on 94 degrees of freedom
## AIC: 81.436
##
## Number of Fisher Scoring iterations: 5
Generalized additive models allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response \(Y\), our model is:
\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]
and the goal is to estimate \(\beta_0,f_1,\ldots,f_p\). For a binary response \(Y\), our model is:
\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]
and the goal is again to estimate \(\beta_0,f_1,\ldots,f_p\)
gam()
We can use the gam()
function, from the gam
package, to fit a generalized additive model. The arguments are similar to glm()
(and to lm()
), with a key distinction
The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp)
, where Y
is the response and X1
, …, Xp
are the predictors. The notation s()
is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect
So, e.g., to fit the model
\[ \mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight}) = \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight} \]
we use:
library(gam, quiet=TRUE)
pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df)
class(pros.gam) # Again, a specialized list
## [1] "gam" "glm" "lm"
pros.gam # It has a special way of printing
## Call:
## gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
##
## Degrees of Freedom: 96 total; 91.00006 Residual
## Residual Deviance: 49.40595
Most of our utility functions work just as before. E.g.,
summary(pros.gam)
##
## Call: gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6343202 -0.4601627 0.0004965 0.5121757 1.8516801
##
## (Dispersion Parameter for gaussian family taken to be 0.5429)
##
## Null Deviance: 127.9177 on 96 degrees of freedom
## Residual Deviance: 49.406 on 91.0001 degrees of freedom
## AIC: 223.8339
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(lcavol) 1 69.003 69.003 127.095 < 2.2e-16 ***
## lweight 1 7.181 7.181 13.227 0.0004571 ***
## Residuals 91 49.406 0.543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(lcavol) 3 1.4344 0.2379
## lweight
But now, plot()
, instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()
):
plot(pros.gam)
We can see that, even though we allowed lcavol
to have a nonlinear effect, this didn’t seem to matter much as its estimated effect came out to be pretty close to linear anyway!