Linear regression modeling

The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics

Given response $Y$ and predictors $X_1,\ldots,X_p$, in a linear regression model, we posit:

\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]

Goal is to estimate parameters, also called coefficients $\beta_0,\beta_1,\ldots,\beta_p$

Prostate cancer data, revisited

Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pros.dat")
dim(pros.df)

## [1] 97  9

head(pros.df)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Fitting a linear regression model with `lm()`

We can use lm() to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument

E.g., for the prostate data, to regress the response variable lpsa onto the predictors variables lcavol and lweight:

pros.lm = lm(lpsa ~ lcavol + lweight, data=pros.df)
class(pros.lm) # Really, a specialized list

## [1] "lm"

names(pros.lm) # Here are its components

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

pros.lm # It has a special way of printing

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Coefficients:
## (Intercept)       lcavol      lweight  
##     -0.8134       0.6515       0.6647

Utility functions

Linear models in R come with a bunch of utility functions, such as coef(), fitted(), residuals(), summary(), plot(), predict(), for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively

These tasks can also be done manually, by extracting at the components of the returned object from lm(), and manipulating them appropriately. But this is discouraged, because:

The manual strategy is more tedious and error prone
Once you master the utility functions, you’ll be able to retrieve coefficients, fitted values, make predictions, etc., in the same way for model objects returned by glm(), gam(), and many others

Retrieving estimated coefficients with `coef()`

So, what were the regression coefficients that we estimated? Use the coef() function, to retrieve them:

pros.coef = coef(pros.lm) # Vector of 3 coefficients
pros.coef

## (Intercept)      lcavol     lweight 
##  -0.8134373   0.6515421   0.6647215

What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:

\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

So, increasing predictor $X_j$ by one unit, while holding all other predictors fixed, increases the expected response by $\beta_j$

E.g., increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, increases the expected value of lpsa (log PSA score) by $\approx 0.65$

Retrieving fitted values with `fitted()`

What does our model predict for the log PSA scores of the 97 mean in question? And how do these compare to the actual log PSA scores? Use the fitted() function, then plot the actual values versus the fitted ones:

pros.fits = fitted(pros.lm) # Vector of 97 fitted values
plot(pros.fits, pros.df$lpsa, main="Actual versus fitted values",
     xlab="Fitted values", ylab="Log PSA values") 
abline(a=0, b=1, lty=2, col="red")

Displaying an overview with `summary()`

The function summary() gives us a nice summary of the linear model we fit:

summary(pros.lm) # Special way of summarizing

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61051 -0.44135 -0.04666  0.53542  1.90424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.81344    0.65309  -1.246 0.216033    
## lcavol       0.65154    0.06693   9.734 6.75e-16 ***
## lweight      0.66472    0.18414   3.610 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 94 degrees of freedom
## Multiple R-squared:  0.5955, Adjusted R-squared:  0.5869 
## F-statistic: 69.19 on 2 and 94 DF,  p-value: < 2.2e-16

This tells us:

The quantiles of the residuals: ideally, this is a perfect normal distribution
The coefficient estimates
Their standard errors
P-values for their individual significances
(Adjusted) R-squared value: how much variability is explained by the model?
F-statistic for the significance of the overall model

Running diagnostics with `plot()`

We can use the plot() function to run a series of diagnostic tests for our regression:

plot(pros.lm) # Special way of plotting

The results are pretty good:

“Residuals vs Fitted” plot: points appear randomly scattered, no particular pattern
“Normal Q-Q” plot: points are mostly along the 45 degree line, so residuals look close to a normal distribution
“Scale-Location” and “Residuals vs Leverage” plots: no obvious groups, no points are too far from the center

There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course

Making predictions with `predict()`

Suppose we had a new observation from a man whose log cancer volume was 4.1, and log cancer weight was 4.5. What would our linear model estimate his log PSA score would be? Use predict():

pros.new = data.frame(lcavol=4.1, lweight=4.5) # Must set up a new data frame
pros.pred = predict(pros.lm, newdata=pros.new) # Now call predict with new df
pros.pred

##        1 
## 4.849132

We’ll learn much more about making/evaluating predictions in our week of mini-lectures on “Statistical Prediction”

Some handy shortcuts

Here are some handy shortcuts, for fitting linear models with lm() (there are also many others):

No intercept (no $\beta_0$ in the mathematical model): use 0 + on the right-hand side of the formula, as in:

summary(lm(lpsa ~ 0 + lcavol + lweight, data=pros.df))

## 
## Call:
## lm(formula = lpsa ~ 0 + lcavol + lweight, data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63394 -0.51181  0.00925  0.49705  1.90715 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## lcavol   0.66394    0.06638   10.00   <2e-16 ***
## lweight  0.43894    0.03249   13.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7441 on 95 degrees of freedom
## Multiple R-squared:  0.9273, Adjusted R-squared:  0.9258 
## F-statistic: 606.1 on 2 and 95 DF,  p-value: < 2.2e-16

Include all predictors: use . on the right-hand side of the formula, as in:

summary(lm(lpsa ~ ., data=pros.df))

## 
## Call:
## lm(formula = lpsa ~ ., data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Include interaction terms: use : between two predictors of interest, to include the interaction between them as a predictor, as in:

summary(lm(lpsa ~ lcavol + lweight + lcavol:svi, data=pros.df))

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + lcavol:svi, data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77358 -0.47304  0.00016  0.41458  1.52657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.75666    0.62507  -1.211 0.229142    
## lcavol       0.51193    0.07816   6.550 3.15e-09 ***
## lweight      0.66234    0.17617   3.760 0.000297 ***
## lcavol:svi   0.25406    0.08156   3.115 0.002445 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7098 on 93 degrees of freedom
## Multiple R-squared:  0.6337, Adjusted R-squared:  0.6219 
## F-statistic: 53.64 on 3 and 93 DF,  p-value: < 2.2e-16

Linear Models

Statistical Computing, 36-350

Wednesday November 3, 2016

Linear regression modeling

Prostate cancer data, revisited

Fitting a linear regression model with `lm()`

Utility functions

Retrieving estimated coefficients with `coef()`

Retrieving fitted values with `fitted()`

Displaying an overview with `summary()`

Running diagnostics with `plot()`

Making predictions with `predict()`

Some handy shortcuts

Linear Models

Statistical Computing, 36-350

Wednesday November 3, 2016

Linear regression modeling

Prostate cancer data, revisited

Fitting a linear regression model with lm()

Utility functions

Retrieving estimated coefficients with coef()

Retrieving fitted values with fitted()

Displaying an overview with summary()

Running diagnostics with plot()

Making predictions with predict()

Some handy shortcuts

Fitting a linear regression model with `lm()`

Retrieving estimated coefficients with `coef()`

Retrieving fitted values with `fitted()`

Displaying an overview with `summary()`

Running diagnostics with `plot()`

Making predictions with `predict()`