save(thing, file="filename")
saves the object thing
in a file called filename
(conventional extension: RData
)save(thing1, thing2, thing3, file="filename")
saves 3 objectsload("name")
loads the object or objects stored in the file called filename
, with their old variable namesgmp = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/gmp.dat")
gmp$pop = round(gmp$gmp/gmp$pcgmp)
save(gmp,file="gmp.RData")
rm(gmp)
exists("gmp")
## [1] FALSE
load(file="gmp.RData")
colnames(gmp)
## [1] "MSA" "gmp" "pcgmp" "pop"
data()
to load themdata(cats,package="MASS")
summary(cats)
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
read.table()
read.csv()
is a short-cut to set the options for reading comma-separated value (CSV) filesData set from 97 men who have prostate cancer
pros = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data")
nrow(pros)
## [1] 97
head(pros)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
write.table()
, write.csv()
write a data frame into a fileload
or save
Let’s create a new fake variable, and print out a new table
pros2 = pros # Make a fresh copy
pros2$bodywt = round(runif(nrow(pros2),150,300)) # Add fake body weights
head(pros2)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train bodywt
## 1 TRUE 179
## 2 TRUE 265
## 3 TRUE 276
## 4 TRUE 281
## 5 TRUE 277
## 6 TRUE 162
#setwd("path_to_dir") # Set working dir (optional)
write.table(pros2, file="prostate2.txt", quote=F) # Now go look at it!
foreign
package on CRAN has tools for reading data files from lots of non-R statistical softwareread.csv()
read.csv()
read_excel()
from the readxl
packageOnce we have data loaded into R, we can do fun and interesting statistics with it. Now we’ll discuss linear modeling, perhaps the most fundamental tool in applied statistics. You’ll learn a lot more in future classes (especially 36-401). Here’s an intro: we posit a relationship \[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \mathrm{noise} \] between a response variable \(Y\) and predictor variables \(X_1, \ldots X_p\). Goal is to estimate the coefficients \(\beta_0, \beta_1, \ldots \beta_p\)
For example, we might think that the log PSA (prostate-specific antigen) of the men can be linearly predicted from the other variables in the data frame (Stamey et al. 1989)
lpsa
: log PSA scorelcavol
: log cancer volumelweight
: log prostate weightage
: age of patientlbph
: log of the amount of benign prostatic hyperplasiasvi
: seminal vesicle invasionlcp
: log of capsular penetrationgleason
: Gleason scorepgg45
: percent of Gleason scores 4 or 5Let’s just look at the data a bit before pursuing the linear model
colnames(pros) # The first 8 columns are the predictor variables
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason"
## [8] "pgg45" "lpsa" "train"
#par(ask=T) # Have R ask us before it makes each plot, uncomment for demo
for (j in 1:8) {
hist(pros[,j], xlab=colnames(pros)[j],
main=paste("Histogram of",colnames(pros[j])),
col="lightblue",breaks=20)
}
par(ask=F) # Back to normal
What did we learn? A bunch of things!
lcp
, the log amount of capsular penetration, is very skewed; a bunch of men with little (or none?), then a big spreadsvi
, the presence of seminal vesicle invasion, is binarygleason
, seems to take only values larger than 6; so how does it relate to pgg45
, the percentage of Gleason scores 4 or 5?(At this point, we’re confused perhaps and go to our doctor friend; we find out that pgg45
is the percentage of Gleason scores they received before their final current Gleason score, stored in gleason
, that were either 4 or 5; higher Gleason score is worse, so pgg45
tells us something about their histories …)
Now let’s look at some correlations between variables
pros.cor = cor(pros[,1:8])
round(pros.cor,3)
## lcavol lweight age lbph svi lcp gleason pgg45
## lcavol 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434
## lweight 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107
## age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276
## lbph 0.027 0.442 0.350 1.000 -0.086 -0.007 0.078 0.078
## svi 0.539 0.155 0.118 -0.086 1.000 0.673 0.320 0.458
## lcp 0.675 0.165 0.128 -0.007 0.673 1.000 0.515 0.632
## gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752
## pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000
Some strong-ish correlations here. Let’s find the biggest one
pros.cor[upper.tri(pros.cor,diag=T)] = 0 # Why the lower tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]),
dim(pros.cor))
colnames(pros)[vars.big.cor]
## [1] "pgg45" "gleason"
This is not surprising, given what we know about pgg45
and gleason
; essentially what this is telling us is the following: if their Gleason score is high now,then they likely had a bad history of Gleason scores
Let’s look at the second biggest correlation, since the first one was kind of trivial
pros.cor.sorted[2]
## [1] 0.6753105
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]),
dim(pros.cor))
colnames(pros)[vars.big.cor]
## [1] "lcp" "lcavol"
(Exercise: write yourself a function to extract the \(k\)th biggest absolute correlation)
This is more interesting! Seems like the log amount of capsular penetration lcp
, and the log cancer volume lcavol
are pretty related
Now let’s plot these two variables
plot(pros[,vars.big.cor])
Seems like there were a bunch of men whose capsular penetration was really small (we knew this already) and their cancer volumes are pretty much evenly distributed; but then, as the amount of capsular penetration increases, it becomes highly related to the volume of the cancer
(Should go back to our doctor friend! Ask if this make sense, ask why, etc. He tells us that when the recorded capsular penetration value is that small, it basically means that it couldn’t be accurately measured, which is why we see this spread in the cancer volume)
So now we want to form a linear model of the log PSA score on these \(p=8\) predictor variables. How? Use the lm()
function:
pros.lm = lm(lpsa ~ lcavol + lweight + age +
lbph + svi + lcp + gleason + pgg45,
data=pros)
summary(pros.lm)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## gleason + pgg45, data = pros)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
Can see the estimated coefficients, and their standard errors. More next time …