Last week: Reading in, reordering, and merging data

Part I

Exploratory data analysis

Why fit statistical (regression) models?

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

Prostate cancer data

Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:

pros.df = 
  read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/pros.dat")
dim(pros.df)
## [1] 97  9

head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Some example questions we might be interested in:

Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

Distributions of prostate cancer variables

colnames(pros.df) # These are the variables
## [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
## [8] "pgg45"   "lpsa"
par(mfrow=c(3,3), mar=c(4,4,2,0.5)) # Setup grid, margins
for (j in 1:ncol(pros.df)) {
  hist(pros.df[,j], xlab=colnames(pros.df)[j],
       main=paste("Histogram of", colnames(pros.df)[j]),
       col="lightblue", breaks=20)
}


What did we learn? A bunch of things! E.g.,

After asking our doctor friends some questions, we learn:

Correlations between prostate cancer variables

pros.cor = cor(pros.df)
round(pros.cor,3) 
##         lcavol lweight   age   lbph    svi    lcp gleason pgg45  lpsa
## lcavol   1.000   0.281 0.225  0.027  0.539  0.675   0.432 0.434 0.734
## lweight  0.281   1.000 0.348  0.442  0.155  0.165   0.057 0.107 0.433
## age      0.225   0.348 1.000  0.350  0.118  0.128   0.269 0.276 0.170
## lbph     0.027   0.442 0.350  1.000 -0.086 -0.007   0.078 0.078 0.180
## svi      0.539   0.155 0.118 -0.086  1.000  0.673   0.320 0.458 0.566
## lcp      0.675   0.165 0.128 -0.007  0.673  1.000   0.515 0.632 0.549
## gleason  0.432   0.057 0.269  0.078  0.320  0.515   1.000 0.752 0.369
## pgg45    0.434   0.107 0.276  0.078  0.458  0.632   0.752 1.000 0.422
## lpsa     0.734   0.433 0.170  0.180  0.566  0.549   0.369 0.422 1.000

Some strong correlations! Let’s find the biggest (in absolute value):

pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]), 
                        dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor] 
## [1] "gleason" "pgg45"

This is not surprising, given what we know about pgg45 and gleason; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores


Let’s find the second biggest correlation (in absolute value):

pros.cor.sorted[2]
## [1] 0.7344603
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]), 
                        dim(pros.cor))
colnames(pros.df)[vars.big.cor] 
## [1] "lcavol" "lpsa"

This is more interesting! If we wanted to predict lpsa from the other variables, then it seems like we should at least include lcavol as a predictor

Visualizing relationships among variables, with pairs()

Can easily look at multiple scatter plots at once, using the pairs() function. The first argument is written like a formula, with no response variable. We’ll hold off on describing more about formulas until we learn lm(), shortly

pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df)

Inspecting relationships over a subset of the observations

As we’ve seen, the lcp takes a bunch of really low values, that don’t appear to have strong relationships with other variables. Let’s get rid of them and see what the relationships look like

pros.df.subset = pros.df[pros.df$lcp > min(pros.df$lcp),]
nrow(pros.df.subset) # Beware, we've lost a half of our data! 
## [1] 52
pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df.subset)

Testing means between two different groups

Recall that svi, the presence of seminal vesicle invasion, is binary:

table(pros.df$svi)
## 
##  0  1 
## 76 21

From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1476128/:

“When the pathologist’s report following radical pros.dftatectomy describes seminal vesicle invasion (SVI) … prostate cancer in the areolar connective tissue around the seminal vesicles and outside the prostate …generally the outlook for the patient is poor.”

Does seminal vesicle invasion relate to the volume of cancer? Weight of cancer?


Let’s do some plotting first:

pros.df$svi = factor(pros.df$svi) 
par(mfrow=c(1,2))
plot(pros.df$svi, pros.df$lcavol, main="lcavol versus svi",
     xlab="SVI (0=no, 1=yes)", ylab="Log cancer volume")
plot(pros.df$svi, pros.df$lweight, main="lweight versus svi",
     xlab="SVI (0=no, 1=yes)", ylab="Log cancer weight")

Visually, lcavol looks like it has a big difference, but lweight perhaps does not


Now let’s try simple two-sample t-tests:

t.test(pros.df$lcavol[pros.df$svi==0],
       pros.df$lcavol[pros.df$svi==1])
## 
##  Welch Two Sample t-test
## 
## data:  pros.df$lcavol[pros.df$svi == 0] and pros.df$lcavol[pros.df$svi == 1]
## t = -8.0351, df = 51.172, p-value = 1.251e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.917326 -1.150810
## sample estimates:
## mean of x mean of y 
##  1.017892  2.551959
t.test(pros.df$lweight[pros.df$svi==0],
       pros.df$lweight[pros.df$svi==1])
## 
##  Welch Two Sample t-test
## 
## data:  pros.df$lweight[pros.df$svi == 0] and pros.df$lweight[pros.df$svi == 1]
## t = -1.8266, df = 42.949, p-value = 0.07472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.33833495  0.01674335
## sample estimates:
## mean of x mean of y 
##  3.594131  3.754927

Confirms what we saw visually

Part II

Linear models

Linear regression modeling

The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics

Given response \(Y\) and predictors \(X_1,\ldots,X_p\), in a linear regression model, we posit:

\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon, \quad \text{where $\epsilon \sim N(0,\sigma^2)$} \]

Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)

Fitting a linear regression model with lm()

We can use lm() to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument

E.g., for the prostate data, to regress the response variable lpsa (log PSA score) onto the predictors variables lcavol (log cancer volume) and lweight(log cancer weight) :

pros.lm = lm(lpsa ~ lcavol + lweight, data=pros.df)
class(pros.lm) # Really, a specialized list
## [1] "lm"
names(pros.lm) # Here are its components
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
pros.lm # It has a special way of printing
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Coefficients:
## (Intercept)       lcavol      lweight  
##     -0.8134       0.6515       0.6647

Utility functions

Linear models in R come with a bunch of utility functions, such as coef(), fitted(), residuals(), summary(), plot(), predict(), for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively

These tasks can also be done manually, by extracting at the components of the returned object from lm(), and manipulating them appropriately. But this is discouraged, because:

Retrieving estimated coefficients with coef()

So, what were the regression coefficients that we estimated? Use the coef() function, to retrieve them:

pros.coef = coef(pros.lm) # Vector of 3 coefficients
pros.coef
## (Intercept)      lcavol     lweight 
##  -0.8134373   0.6515421   0.6647215

What does a linear regression coefficient mean, i.e., how do you interpret it? Note, from our linear model:

\[ \mathbb{E}(Y|X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

So, increasing predictor \(X_j\) by one unit, while holding all other predictors fixed, increases the expected response by \(\beta_j\)

E.g., increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, increases the expected value of lpsa (log PSA score) by \(\approx 0.65\)

Retrieving fitted values with fitted()

What does our model predict for the log PSA scores of the 97 mean in question? And how do these compare to the actual log PSA scores? Use the fitted() function, then plot the actual values versus the fitted ones:

pros.fits = fitted(pros.lm) # Vector of 97 fitted values
plot(pros.fits, pros.df$lpsa, main="Actual versus fitted values",
     xlab="Fitted values", ylab="Log PSA values") 
abline(a=0, b=1, lty=2, col="red") 

Displaying an overview with summary()

The function summary() gives us a nice summary of the linear model we fit:

summary(pros.lm) # Special way of summarizing
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61051 -0.44135 -0.04666  0.53542  1.90424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.81344    0.65309  -1.246 0.216033    
## lcavol       0.65154    0.06693   9.734 6.75e-16 ***
## lweight      0.66472    0.18414   3.610 0.000494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 94 degrees of freedom
## Multiple R-squared:  0.5955, Adjusted R-squared:  0.5869 
## F-statistic: 69.19 on 2 and 94 DF,  p-value: < 2.2e-16

This tells us:

Running diagnostics with plot()

We can use the plot() function to run a series of diagnostic tests for our regression:

plot(pros.lm) # Special way of plotting

The results are pretty good:

There is a science (and an art?) to interpreting these; you’ll learn a lot more in the Modern Regression 36-401 course

Making predictions with predict()

Suppose we had a new observation from a man whose log cancer volume was 4.1, and log cancer weight was 4.5. What would our linear model estimate his log PSA score would be? Use predict():

pros.new = data.frame(lcavol=4.1, lweight=4.5) # Must set up a new data frame
pros.pred = predict(pros.lm, newdata=pros.new) # Now call predict with new df
pros.pred
##        1 
## 4.849132

We’ll learn much more about making/evaluating statistical predictions later in the course

Some handy shortcuts

Here are some handy shortcuts, for fitting linear models with lm() (there are also many others):

Part III

Beyond linear models

What other kinds of models are there?

Linear regression models, as we’ve said, are useful and ubiquitous. But there’s a lot else out there. What else?

Today we’ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they’re quite different, and you’ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course (or the Data Mining 36-462 course)

Logistic regression modeling

Given response \(Y\) and predictors \(X_1,\ldots,X_p\), where \(Y \in \{0,1\}\) is a binary outcome. In a logistic regression model, we posit the relationship:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

(where \(Y|X\) is shorthand for \(Y|X_1,\ldots,X_p\)). Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)

Fitting a logistic regression model with glm()

We can use glm() to fit a logistic regression model. The arguments are very similar to lm()

The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument. We must also specify family="binomial" to get logistic regression

E.g., for the prostate data, suppose we add a column lpsa.high to our data frame pros.df, as the indicator of whether the lpsa variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable lpsa.high onto the predictor variables lcavol and lweight:

pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome
table(pros.df$lpsa.high) # There are 56 men with high PSA scores
## 
##  0  1 
## 41 56
pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial")
class(pros.glm) # Really, a specialized list
## [1] "glm" "lm"
pros.glm # It has a special way of printing
## 
## Call:  glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Coefficients:
## (Intercept)       lcavol      lweight  
##     -12.551        1.520        3.034  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  94 Residual
## Null Deviance:       132.1 
## Residual Deviance: 75.44     AIC: 81.44

Utility functions work as before

For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef(), fitted(), residuals(), summary(), plot(), predict() work pretty much just as with lm(). E.g.,

coef(pros.glm) # Logisitic regression coefficients 
## (Intercept)      lcavol     lweight 
##  -12.551478    1.520006    3.034018
summary(pros.glm) # Special way of summarizing
## 
## Call:
## glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7957  -0.6413   0.2192   0.5608   2.2209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.5515     3.2422  -3.871 0.000108 ***
## lcavol        1.5200     0.3604   4.218 2.47e-05 ***
## lweight       3.0340     0.8615   3.522 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.142  on 96  degrees of freedom
## Residual deviance:  75.436  on 94  degrees of freedom
## AIC: 81.436
## 
## Number of Fisher Scoring iterations: 5
p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes
y.hat = round(p.hat) # This is one way we'd compute fitted outcomes
table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"
##      y.true
## y.hat  0  1
##     0 33  5
##     1  8 51

What does a logistic regression coefficient mean?

How do you interpret a logistic regression coefficient? Note, from our logistic model:

\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]

So, increasing predictor \(X_j\) by one unit, while holding all other predictor fixed, multiplies the odds by \(e^{\beta_j}\). E.g.,

coef(pros.glm)  
## (Intercept)      lcavol     lweight 
##  -12.551478    1.520006    3.034018

So, increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, multiplies the odds of lpsa.high (high PSA score, over 10) by \(\approx e^{1.52} \approx 4.57\)

Creating a binary variable “on-the-fly”

We can easily create a binary variable “on-the-fly” by using the I() function inside a call to glm():

pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df, 
               family="binomial")
summary(pros.glm) # Same as before
## 
## Call:
## glm(formula = I(lpsa > log(10)) ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7957  -0.6413   0.2192   0.5608   2.2209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.5515     3.2422  -3.871 0.000108 ***
## lcavol        1.5200     0.3604   4.218 2.47e-05 ***
## lweight       3.0340     0.8615   3.522 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.142  on 96  degrees of freedom
## Residual deviance:  75.436  on 94  degrees of freedom
## AIC: 81.436
## 
## Number of Fisher Scoring iterations: 5

Generalized additive modeling

Generalized additive models allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response \(Y\), our model is:

\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is to estimate \(\beta_0,f_1,\ldots,f_p\). For a binary response \(Y\), our model is:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is again to estimate \(\beta_0,f_1,\ldots,f_p\)

Fitting a generalized additive model with gam()

We can use the gam() function, from the gam package, to fit a generalized additive model. The arguments are similar to glm() (and to lm()), with a key distinction

The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp), where Y is the response and X1, …, Xp are the predictors. The notation s() is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect

So, e.g., to fit the model

\[ \mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight}) = \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight} \]

we use:

library(gam, quiet=TRUE)
## Loaded gam 1.15
pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df)
class(pros.gam) # Again, a specialized list
## [1] "Gam" "glm" "lm"
pros.gam # It has a special way of printing
## Call:
## gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## 
## Degrees of Freedom: 96 total; 91.00006 Residual
## Residual Deviance: 49.40595

Most utility functions work as before

Most of our utility functions work just as before. E.g.,

summary(pros.gam)
## 
## Call: gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.6343202 -0.4601627  0.0004965  0.5121757  1.8516801 
## 
## (Dispersion Parameter for gaussian family taken to be 0.5429)
## 
##     Null Deviance: 127.9177 on 96 degrees of freedom
## Residual Deviance: 49.406 on 91.0001 degrees of freedom
## AIC: 223.8339 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## s(lcavol)  1 69.003  69.003 127.095 < 2.2e-16 ***
## lweight    1  7.181   7.181  13.227 0.0004571 ***
## Residuals 91 49.406   0.543                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)
## (Intercept)                      
## s(lcavol)         3 1.4344 0.2379
## lweight

But now, plot(), instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()):

plot(pros.gam)

We can see that, even though we allowed lcavol to have a nonlinear effect, this didn’t seem to matter much as its estimated effect came out to be pretty close to linear anyway!

Summary