In previous episodes

Seen functions to load data in passing
Learned about string manipulation and regexp

Outline

Getting data into and out of the system when it’s already in R format
Import and export when the data is already very structured and machine-readable
Dealing with less structured data
Intro to regression modeling

Reading data from R

You can load and save R objects
- R has its own format for this, which is shared across operating systems
- It’s an open, documented format if you really want to pry into it
save(thing, file="filename") saves the object thing in a file called filename (conventional extension: RData)
save(thing1, thing2, thing3, file="filename") saves 3 objects
load("name") loads the object or objects stored in the file called filename, with their old variable names

gmp = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/gmp.dat")
gmp$pop = round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.RData")
rm(gmp)
exists("gmp")

## [1] FALSE

load(file="gmp.RData")
colnames(gmp)

## [1] "MSA"   "gmp"   "pcgmp" "pop"

We can load or save more than one object at once; this is how RStudio will load your whole workspace when you’re starting, and offer to save it when you’re done
Many packages come with saved data objects; there’s the convenience function data() to load them

data(cats,package="MASS")
summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Non-R data tables

Tables full of data, just not in the R file format
Main function: read.table()
- Presumes space-separated fields, one line per row
- Main argument is the file name or URL
- Returns a data frame
- Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file…
read.csv() is a short-cut to set the options for reading comma-separated value (CSV) files

Example: prostate cancer data

Data set from 97 men who have prostate cancer

pros = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
nrow(pros)

## [1] 97

head(pros)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Writing data tables

Counterpart functions write.table(), write.csv() write a data frame into a file
Drawback: takes a lot more disk space than what you get from load or save
Advantage: can communicate with other programs, and you can even edit it manually

Example: prostate cancer data

Let’s create a new fake variable, and print out a new table

pros2 = pros                                      # Make a fresh copy 
pros2$bodywt = round(runif(nrow(pros2),150,300))  # Add fake body weights
head(pros2)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train bodywt
## 1  TRUE    179
## 2  TRUE    265
## 3  TRUE    276
## 4  TRUE    281
## 5  TRUE    277
## 6  TRUE    162

#setwd("path_to_dir")                             # Set working dir (optional)
write.table(pros2, file="prostate2.txt", quote=F) # Now go look at it!

Less friendly data formats

We learned last week how to mine text files for data, using regular expressions (an extremely useful skill!)
The foreign package on CRAN has tools for reading data files from lots of non-R statistical software
Spreadsheets are special

Spreadsheets considered harmful

Spreadsheets look like they should be data frames
Real spreadsheets are full of ugly irregularities
- Values or formulas?
- Headers, footers, side-comments, notes
- Columns change meaning half-way down
Ought-to-be-notorious source of errors in both industry and academia

Spreadsheets, if you have to

Save the spreadsheet as a CSV; read.csv()
Save the spreadsheet as a CSV; edit in a text editor; read.csv()
Use read_excel() from the readxl package
You may still need to do a lot of frustrating hacking to get things to work

Let’s do a bit of statistical modeling

Once we have data loaded into R, we can do fun and interesting statistics with it. Now we’ll discuss linear modeling, perhaps the most fundamental tool in applied statistics. You’ll learn a lot more in future classes (especially 36-401). Here’s an intro: we posit a relationship \[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \mathrm{noise} \] between a response variable \(Y\) and predictor variables \(X_1, \ldots X_p\). Goal is to estimate the coefficients \(\beta_0, \beta_1, \ldots \beta_p\)

For example, we might think that the log PSA (prostate-specific antigen) of the men can be linearly predicted from the other variables in the data frame (Stamey et al. 1989)

lpsa: log PSA score
lcavol: log cancer volume
lweight: log prostate weight
age: age of patient
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: Gleason score
pgg45: percent of Gleason scores 4 or 5

Let’s just look at the data a bit before pursuing the linear model

colnames(pros) # The first 8 columns are the predictor variables

##  [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason"
##  [8] "pgg45"   "lpsa"    "train"

#par(ask=T)    # Have R ask us before it makes each plot, uncomment for demo
for (j in 1:8) {
  hist(pros[,j], xlab=colnames(pros)[j],
       main=paste("Histogram of",colnames(pros[j])),
       col="lightblue",breaks=20)
}

par(ask=F)     # Back to normal

What did we learn? A bunch of things!

lcp, the log amount of capsular penetration, is very skewed; a bunch of men with little (or none?), then a big spread
svi, the presence of seminal vesicle invasion, is binary
gleason, seems to take only values larger than 6; so how does it relate to pgg45, the percentage of Gleason scores 4 or 5?

(At this point, we’re confused perhaps and go to our doctor friend; we find out that pgg45 is the percentage of Gleason scores they received before their final current Gleason score, stored in gleason, that were either 4 or 5; higher Gleason score is worse, so pgg45 tells us something about their histories …)

Now let’s look at some correlations between variables

pros.cor = cor(pros[,1:8])
round(pros.cor,3)

##         lcavol lweight   age   lbph    svi    lcp gleason pgg45
## lcavol   1.000   0.281 0.225  0.027  0.539  0.675   0.432 0.434
## lweight  0.281   1.000 0.348  0.442  0.155  0.165   0.057 0.107
## age      0.225   0.348 1.000  0.350  0.118  0.128   0.269 0.276
## lbph     0.027   0.442 0.350  1.000 -0.086 -0.007   0.078 0.078
## svi      0.539   0.155 0.118 -0.086  1.000  0.673   0.320 0.458
## lcp      0.675   0.165 0.128 -0.007  0.673  1.000   0.515 0.632
## gleason  0.432   0.057 0.269  0.078  0.320  0.515   1.000 0.752
## pgg45    0.434   0.107 0.276  0.078  0.458  0.632   0.752 1.000

Some strong-ish correlations here. Let’s find the biggest one

pros.cor[upper.tri(pros.cor,diag=T)] = 0 # Why the lower tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]

## [1] 0.7519045

vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]), 
                        dim(pros.cor))
colnames(pros)[vars.big.cor]

## [1] "pgg45"   "gleason"

This is not surprising, given what we know about pgg45 and gleason; essentially what this is telling us is the following: if their Gleason score is high now,then they likely had a bad history of Gleason scores

Let’s look at the second biggest correlation, since the first one was kind of trivial

pros.cor.sorted[2]

## [1] 0.6753105

vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]), 
                        dim(pros.cor))
colnames(pros)[vars.big.cor]

## [1] "lcp"    "lcavol"

(Exercise: write yourself a function to extract the \(k\)th biggest absolute correlation)

This is more interesting! Seems like the log amount of capsular penetration lcp, and the log cancer volume lcavol are pretty related

Now let’s plot these two variables

plot(pros[,vars.big.cor])

Seems like there were a bunch of men whose capsular penetration was really small (we knew this already) and their cancer volumes are pretty much evenly distributed; but then, as the amount of capsular penetration increases, it becomes highly related to the volume of the cancer

(Should go back to our doctor friend! Ask if this make sense, ask why, etc. He tells us that when the recorded capsular penetration value is that small, it basically means that it couldn’t be accurately measured, which is why we see this spread in the cancer volume)

So now we want to form a linear model of the log PSA score on these \(p=8\) predictor variables. How? Use the lm() function:

pros.lm = lm(lpsa ~ lcavol + lweight + age + 
               lbph + svi + lcp + gleason + pgg45, 
             data=pros)
summary(pros.lm)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     gleason + pgg45, data = pros)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Can see the estimated coefficients, and their standard errors. More next time …

Summary

Loading and saving R objects is very easy
Reading and writing data tables is pretty easy
Dealing with other formats are harder, but not impossible
Once we have data loaded, we can do a lot of cool things with it

Lecture 7: Getting Data

Statistical Computing, 36-350

Monday October 5, 2015

In previous episodes

Outline

Reading data from R

Non-R data tables

Example: prostate cancer data

Writing data tables

Example: prostate cancer data

Less friendly data formats

Spreadsheets considered harmful

Spreadsheets, if you have to

Let’s do a bit of statistical modeling

Summary