The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics
Given response \(Y\) and predictors \(X_1,\ldots,X_p\), in a linear regression model, we posit:
\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]
Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)
Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):
pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
lm()
We can use lm()
to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp
, where Y
is the response and X1
, …, Xp
are the predictors. These refer to column names of variables in a data frame, that we pass in through the data
argument
E.g., for the prostate data, to regress the response variable lpsa
onto the predictors variables lcavol
and lweight
:
pros.lm = lm(lpsa ~ lcavol + lweight, data=pros.df)
class(pros.lm) # Really, a specialized list
## [1] "lm"
names(pros.lm) # Here are its components
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
pros.lm # It has a special way of printing
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
##
## Coefficients:
## (Intercept) lcavol lweight
## -0.8134 0.6515 0.6647
Linear models in R come with a bunch of utility functions, such as coef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
, for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively
These tasks can also be done manually, by extracting at the components of the returned object from lm()
, and manipulating them appropriately. But this is discouraged, because:
glm()
, gam()
, and many otherscoef()
So, what were the regression coefficients that we estimated? Use the coef()
function, to retrieve them:
pros.coef = coef(pros.lm) # Vector of 3 coefficients
pros.coef
## (Intercept) lcavol lweight
## -0.8134373 0.6515421 0.6647215
What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:
\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
So, increasing predictor \(X_j\) by one unit, while holding all other predictors fixed, increases the expected response by \(\beta_j\)
E.g., increasing lcavol
(log cancer volume) by one unit (one cc), while holding lweight
(log cancer weight) fixed, increases the expected value of lpsa
(log PSA score) by \(\approx 0.65\)
fitted()
What does our model predict for the log PSA scores of the 97 mean in question? And how do these compare to the actual log PSA scores? Use the fitted()
function, then plot the actual values versus the fitted ones:
pros.fits = fitted(pros.lm) # Vector of 97 fitted values
plot(pros.fits, pros.df$lpsa, main="Actual versus fitted values",
xlab="Fitted values", ylab="Log PSA values")
abline(a=0, b=1, lty=2, col="red")
summary()
The function summary()
gives us a nice summary of the linear model we fit:
summary(pros.lm) # Special way of summarizing
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61051 -0.44135 -0.04666 0.53542 1.90424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.81344 0.65309 -1.246 0.216033
## lcavol 0.65154 0.06693 9.734 6.75e-16 ***
## lweight 0.66472 0.18414 3.610 0.000494 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7419 on 94 degrees of freedom
## Multiple R-squared: 0.5955, Adjusted R-squared: 0.5869
## F-statistic: 69.19 on 2 and 94 DF, p-value: < 2.2e-16
This tells us:
plot()
We can use the plot()
function to run a series of diagnostic tests for our regression:
plot(pros.lm) # Special way of plotting
The results are pretty good:
There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course
predict()
Suppose we had a new observation from a man whose log cancer volume was 4.1, and log cancer weight was 4.5. What would our linear model estimate his log PSA score would be? Use predict()
:
pros.new = data.frame(lcavol=4.1, lweight=4.5) # Must set up a new data frame
pros.pred = predict(pros.lm, newdata=pros.new) # Now call predict with new df
pros.pred
## 1
## 4.849132
We’ll learn much more about making/evaluating predictions in our week of mini-lectures on “Statistical Prediction”
Here are some handy shortcuts, for fitting linear models with lm()
(there are also many others):
No intercept (no \(\beta_0\) in the mathematical model): use 0 +
on the right-hand side of the formula, as in:
summary(lm(lpsa ~ 0 + lcavol + lweight, data=pros.df))
##
## Call:
## lm(formula = lpsa ~ 0 + lcavol + lweight, data = pros.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63394 -0.51181 0.00925 0.49705 1.90715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lcavol 0.66394 0.06638 10.00 <2e-16 ***
## lweight 0.43894 0.03249 13.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7441 on 95 degrees of freedom
## Multiple R-squared: 0.9273, Adjusted R-squared: 0.9258
## F-statistic: 606.1 on 2 and 95 DF, p-value: < 2.2e-16
Include all predictors: use .
on the right-hand side of the formula, as in:
summary(lm(lpsa ~ ., data=pros.df))
##
## Call:
## lm(formula = lpsa ~ ., data = pros.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
Include interaction terms: use :
between two predictors of interest, to include the interaction between them as a predictor, as in:
summary(lm(lpsa ~ lcavol + lweight + lcavol:svi, data=pros.df))
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + lcavol:svi, data = pros.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77358 -0.47304 0.00016 0.41458 1.52657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.75666 0.62507 -1.211 0.229142
## lcavol 0.51193 0.07816 6.550 3.15e-09 ***
## lweight 0.66234 0.17617 3.760 0.000297 ***
## lcavol:svi 0.25406 0.08156 3.115 0.002445 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7098 on 93 degrees of freedom
## Multiple R-squared: 0.6337, Adjusted R-squared: 0.6219
## F-statistic: 53.64 on 3 and 93 DF, p-value: < 2.2e-16