Lecture 7: Getting Data

Statistical Computing, 36-350

Monday October 5, 2015

In previous episodes

Outline

Reading data from R

gmp = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/gmp.dat")
gmp$pop = round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.RData")
rm(gmp)
exists("gmp")
## [1] FALSE
load(file="gmp.RData")
colnames(gmp)
## [1] "MSA"   "gmp"   "pcgmp" "pop"
data(cats,package="MASS")
summary(cats)
##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Non-R data tables

Example: prostate cancer data

Data set from 97 men who have prostate cancer

pros = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
nrow(pros)
## [1] 97
head(pros)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Writing data tables

Example: prostate cancer data

Let’s create a new fake variable, and print out a new table

pros2 = pros                                      # Make a fresh copy 
pros2$bodywt = round(runif(nrow(pros2),150,300))  # Add fake body weights
head(pros2)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train bodywt
## 1  TRUE    282
## 2  TRUE    228
## 3  TRUE    205
## 4  TRUE    190
## 5  TRUE    208
## 6  TRUE    266
#setwd("path_to_dir")                             # Set working dir (optional)
write.table(pros2, file="prostate2.txt", quote=F) # Now go look at it!

Less friendly data formats

Spreadsheets considered harmful

Spreadsheets, if you have to

Let’s do a bit of statistical modeling

Once we have data loaded into R, we can do fun and interesting statistics with it. Now we’ll discuss linear modeling, perhaps the most fundamental tool in applied statistics. You’ll learn a lot more in future classes (especially 36-401). Here’s an intro: we posit a relationship \[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \mathrm{noise} \] between a response variable \(Y\) and predictor variables \(X_1, \ldots X_p\). Goal is to estimate the coefficients \(\beta_0, \beta_1, \ldots \beta_p\)

For example, we might think that the log PSA (prostate-specific antigen) of the men can be linearly predicted from the other variables in the data frame (Stamey et al. 1989)

Let’s just look at the data a bit before pursuing the linear model

colnames(pros) # The first 8 columns are the predictor variables
##  [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
##  [8] "pgg45"   "lpsa"    "train"
#par(ask=T)    # Have R ask us before it makes each plot, uncomment for demo
for (j in 1:8) {
  hist(pros[,j], xlab=colnames(pros)[j],
       main=paste("Histogram of",colnames(pros[j])),
       col="lightblue",breaks=20)
}

par(ask=F)     # Back to normal

What did we learn? A bunch of things!

(At this point, we’re confused perhaps and go to our doctor friend; we find out that pgg45 is the percentage of Gleason scores they received before their final current Gleason score, stored in gleason, that were either 4 or 5; higher Gleason score is worse, so pgg45 tells us something about their histories …)

Now let’s look at some correlations between variables

pros.cor = cor(pros[,1:8])
round(pros.cor,3)
##         lcavol lweight   age   lbph    svi    lcp gleason pgg45
## lcavol   1.000   0.281 0.225  0.027  0.539  0.675   0.432 0.434
## lweight  0.281   1.000 0.348  0.442  0.155  0.165   0.057 0.107
## age      0.225   0.348 1.000  0.350  0.118  0.128   0.269 0.276
## lbph     0.027   0.442 0.350  1.000 -0.086 -0.007   0.078 0.078
## svi      0.539   0.155 0.118 -0.086  1.000  0.673   0.320 0.458
## lcp      0.675   0.165 0.128 -0.007  0.673  1.000   0.515 0.632
## gleason  0.432   0.057 0.269  0.078  0.320  0.515   1.000 0.752
## pgg45    0.434   0.107 0.276  0.078  0.458  0.632   0.752 1.000

Some strong-ish correlations here. Let’s find the biggest one

pros.cor[upper.tri(pros.cor,diag=T)] = 0 # Why the lower tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]), 
                        dim(pros.cor))
colnames(pros)[vars.big.cor] 
## [1] "pgg45"   "gleason"

This is not surprising, given what we know about pgg45 and gleason; essentially what this is telling us is the following: if their Gleason score is high now,then they likely had a bad history of Gleason scores

Let’s look at the second biggest correlation, since the first one was kind of trivial

pros.cor.sorted[2]
## [1] 0.6753105
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]), 
                        dim(pros.cor))
colnames(pros)[vars.big.cor] 
## [1] "lcp"    "lcavol"

(Exercise: write yourself a function to extract the \(k\)th biggest absolute correlation)

This is more interesting! Seems like the log amount of capsular penetration lcp, and the log cancer volume lcavol are pretty related

Now let’s plot these two variables

plot(pros[,vars.big.cor])

Seems like there were a bunch of men whose capsular penetration was really small (we knew this already) and their cancer volumes are pretty much evenly distributed; but then, as the amount of capsular penetration increases, it becomes highly related to the volume of the cancer

(Should go back to our doctor friend! Ask if this make sense, ask why, etc. He tells us that when the recorded capsular penetration value is that small, it basically means that it couldn’t be accurately measured, which is why we see this spread in the cancer volume)

So now we want to form a linear model of the log PSA score on these \(p=8\) predictor variables. How? Use the lm() function:

pros.lm = lm(lpsa ~ lcavol + lweight + age + 
               lbph + svi + lcp + gleason + pgg45, 
             data=pros)
summary(pros.lm)
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     gleason + pgg45, data = pros)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Can see the estimated coefficients, and their standard errors. More next time …

Summary