You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them
So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?
Data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:
lpsa
: log PSA scorelcavol
: log cancer volumelweight
: log prostate weightage
: age of patientlbph
: log of the amount of benign prostatic hyperplasiasvi
: seminal vesicle invasionlcp
: log of capsular penetrationgleason
: Gleason scorepgg45
: percent of Gleason scores 4 or 5pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
Some example questions we might be interested in:
lcavol
and lweight
?svi
and lcavol
, lweight
?lpsa
from the other variables?lpsa
is high or low, from other variables?Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:
colnames(pros.df) # These are the variables
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa"
par(mfrow=c(3,3), mar=c(4,4,2,0.5)) # Setup grid, margins
for (j in 1:ncol(pros.df)) {
hist(pros.df[,j], xlab=colnames(pros.df)[j],
main=paste("Histogram of", colnames(pros.df)[j]),
col="lightblue", breaks=20)
}
What did we learn? A bunch of things! E.g.,
svi
, the presence of seminal vesicle invasion, is binarylcp
, the log amount of capsular penetration, is very skewed, a bunch of men with little (or none?), then a big spread; why is this?gleason
, takes integer values of 6 and larger; how does it relate to pgg45
, the percentage of Gleason scores 4 or 5?lspa
, the log PSA score, is close-ish to normally distributedAfter asking our doctor friends some questions, we learn:
min(pros.df$lcp)
\(\approx \log{0.25}\))pgg45
measures the percentage of 4 or 5 Gleason scores that were recorded over their visit history before their final current Gleason score, stored in gleason
; a higher Gleason score is worse, so pgg45
tells us something about the severity of their cancer in the pastpros.cor = cor(pros.df)
round(pros.cor,3)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## lcavol 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434 0.734
## lweight 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107 0.433
## age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276 0.170
## lbph 0.027 0.442 0.350 1.000 -0.086 -0.007 0.078 0.078 0.180
## svi 0.539 0.155 0.118 -0.086 1.000 0.673 0.320 0.458 0.566
## lcp 0.675 0.165 0.128 -0.007 0.673 1.000 0.515 0.632 0.549
## gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752 0.369
## pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000 0.422
## lpsa 0.734 0.433 0.170 0.180 0.566 0.549 0.369 0.422 1.000
Some strong correlations! Let’s find the biggest (in absolute value):
pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]),
dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor]
## [1] "gleason" "pgg45"
This is not surprising, given what we know about pgg45
and gleason
; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores
Let’s find the second biggest correlation (in absolute value):
pros.cor.sorted[2]
## [1] 0.7344603
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]),
dim(pros.cor))
colnames(pros.df)[vars.big.cor]
## [1] "lcavol" "lpsa"
This is more interesting! If we wanted to predict lpsa
from the other variables, then it seems like we should at least include lcavol
as a predictor
pairs()
Can easily look at multiple scatter plots at once, using the pairs()
function. The first argument is written like a formula, with no response variable. We’ll hold off on describing more about formulas until we learn lm()
pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df)
As we’ve seen, the lcp
takes a bunch of really low values, that don’t appear to have strong relationships with other variables. Let’s get rid of them and see what the relationships/correlations look like
pros.df.subset = pros.df[pros.df$lcp > min(pros.df$lcp),]
nrow(pros.df.subset) # Beware, we've lost a half of our data!
## [1] 52
pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df.subset)
cor(pros.df.subset[,c("lpsa","lcavol","lweight","lcp")])
## lpsa lcavol lweight lcp
## lpsa 1.0000000 0.6241707 0.2434663 0.5296747
## lcavol 0.6241707 1.0000000 0.2209352 0.8049728
## lweight 0.2434663 0.2209352 1.0000000 0.1134501
## lcp 0.5296747 0.8049728 0.1134501 1.0000000
Recall that svi
, the presence of seminal vesicle invasion, is binary:
table(pros.df$svi)
##
## 0 1
## 76 21
From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1476128/:
“When the pathologist’s report following radical pros.dftatectomy describes seminal vesicle invasion (SVI) … prostate cancer in the areolar connective tissue around the seminal vesicles and outside the prostate …generally the outlook for the patient is poor.”
Does seminal vesicle invasion relate to the volume of cancer? Weight of cancer? Let’s do some plotting first:
pros.df$svi = factor(pros.df$svi)
par(mfrow=c(1,2))
plot(pros.df$svi, pros.df$lcavol, main="lcavol versus svi",
xlab="SVI (0=no, 1=yes)", ylab="Log cancer volume")
plot(pros.df$svi, pros.df$lweight, main="lweight versus svi",
xlab="SVI (0=no, 1=yes)", ylab="Log cancer weight")
Visually, lcavol
looks like it has a big difference, but lweight
perhaps does not. Let’s try simple two-sample t-tests:
t.test(pros.df$lcavol[pros.df$svi==0],
pros.df$lcavol[pros.df$svi==1])
##
## Welch Two Sample t-test
##
## data: pros.df$lcavol[pros.df$svi == 0] and pros.df$lcavol[pros.df$svi == 1]
## t = -8.0351, df = 51.172, p-value = 1.251e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.917326 -1.150810
## sample estimates:
## mean of x mean of y
## 1.017892 2.551959
t.test(pros.df$lweight[pros.df$svi==0],
pros.df$lweight[pros.df$svi==1])
##
## Welch Two Sample t-test
##
## data: pros.df$lweight[pros.df$svi == 0] and pros.df$lweight[pros.df$svi == 1]
## t = -1.8266, df = 42.949, p-value = 0.07472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.33833495 0.01674335
## sample estimates:
## mean of x mean of y
## 3.594131 3.754927
Confirms what we saw visually
Say we want to form a simple linear regression model of the lpsa
(log PSA score) on lcavol
(log cancer volume). How? Use the lm()
function, with the first argument being a regression formula:
pros.lm = lm(lpsa ~ lcavol, data=pros.df)
coef(pros.lm) # These are the coefficients
## (Intercept) lcavol
## 1.5072975 0.7193204
summary(pros.lm) # This is a summary
##
## Call:
## lm(formula = lpsa ~ lcavol, data = pros.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67624 -0.41648 0.09859 0.50709 1.89672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.50730 0.12194 12.36 <2e-16 ***
## lcavol 0.71932 0.06819 10.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7875 on 95 degrees of freedom
## Multiple R-squared: 0.5394, Adjusted R-squared: 0.5346
## F-statistic: 111.3 on 1 and 95 DF, p-value: < 2.2e-16
# Here's a plot of the relationship with the regression line
plot(pros.df$lcavol, pros.df$lpsa,
xlab="Log cancer volume", ylab="Log PSA score")
abline(a=coef(pros.lm)[1], b=coef(pros.lm)[2], col="red")
To regress lpsa
onto (say) lcavol
and lweight
, we simply change the formula to lpsa ~ lcavol + lweight
. We’ll see much more in the next mini-lecture