Linear regression modeling

The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics

Given response \(Y\) and predictors \(X_1,\ldots,X_p\), in a linear regression model, we posit:

\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]

Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)

Prostate cancer data, revisited

Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pros.dat")
dim(pros.df)
## [1] 97  9
head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Fitting a linear regression model with lm()

We can use lm() to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument

E.g., for the prostate data, to regress the response variable lpsa onto the predictors variables lcavol and lweight:

pros.lm = lm(lpsa ~ lcavol + lweight, data=pros.df)
class(pros.lm) # Really, a specialized list
## [1] "lm"
names(pros.lm) # Here are its components
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
pros.lm # It has a special way of printing
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Coefficients:
## (Intercept)       lcavol      lweight  
##     -0.8134       0.6515       0.6647

Utility functions

Linear models in R come with a bunch of utility functions, such as coef(), fitted(), residuals(), summary(), plot(), predict(), for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively

These tasks can also be done manually, by extracting at the components of the returned object from lm(), and manipulating them appropriately. But this is discouraged, because:

Retrieving estimated coefficients with coef()

So, what were the regression coefficients that we estimated? Use the coef() function, to retrieve them:

pros.coef = coef(pros.lm) # Vector of 3 coefficients
pros.coef
## (Intercept)      lcavol     lweight 
##  -0.8134373   0.6515421   0.6647215

What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:

\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

So, increasing predictor \(X_j\) by one unit, while holding all other predictors fixed, increases the expected response by \(\beta_j\)

E.g., increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, increases the expected value of lpsa (log PSA score) by \(\approx 0.65\)

Retrieving fitted values with fitted()

What does our model predict for the log PSA scores of the 97 mean in question? And how do these compare to the actual log PSA scores? Use the fitted() function, then plot the actual values versus the fitted ones:

pros.fits = fitted(pros.lm) # Vector of 97 fitted values
plot(pros.fits, pros.df$lpsa, main="Actual versus fitted values",
     xlab="Fitted values", ylab="Log PSA values") 
abline(a=0, b=1, lty=2, col="red") 

Displaying an overview with summary()

The function summary() gives us a nice summary of the linear model we fit:

summary(pros.lm) # Special way of summarizing
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61051 -0.44135 -0.04666  0.53542  1.90424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.81344    0.65309  -1.246 0.216033    
## lcavol       0.65154    0.06693   9.734 6.75e-16 ***
## lweight      0.66472    0.18414   3.610 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 94 degrees of freedom
## Multiple R-squared:  0.5955, Adjusted R-squared:  0.5869 
## F-statistic: 69.19 on 2 and 94 DF,  p-value: < 2.2e-16

This tells us:

Running diagnostics with plot()

We can use the plot() function to run a series of diagnostic tests for our regression:

plot(pros.lm) # Special way of plotting

The results are pretty good:

There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course

Making predictions with predict()

Suppose we had a new observation from a man whose log cancer volume was 4.1, and log cancer weight was 4.5. What would our linear model estimate his log PSA score would be? Use predict():

pros.new = data.frame(lcavol=4.1, lweight=4.5) # Must set up a new data frame
pros.pred = predict(pros.lm, newdata=pros.new) # Now call predict with new df
pros.pred
##        1 
## 4.849132

We’ll learn much more about making/evaluating predictions in our week of mini-lectures on “Statistical Prediction”

Some handy shortcuts

Here are some handy shortcuts, for fitting linear models with lm() (there are also many others):