We’ve already covered a lot of material this semester. Let’s consider what we’ll do with R in a typical application:
Read in some data – maybe something nicely formatted, maybe something we have to parse.
Identify interesting subsets of the data (sometimes)
Fit a model to the data
And/Or do something over and over again
Make our results presentable ot others (usually with a plot)
You’re asked to do all of these things on the midterm. We’ll take a short look at each of them now.
Sometimes we have nicely formatted data, in which case we can use read.table()
or read.csv()
. These give us a data frame.
pros = read.table('http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data')
class(pros)
## [1] "data.frame"
Let’s compare the top of the file and the resulting data frame:
From the file:
lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
1 -0.579818495 2.769459 50 -1.38629436 0 -1.38629436 6 0 -0.4307829 T
2 -0.994252273 3.319626 58 -1.38629436 0 -1.38629436 6 0 -0.1625189 T
3 -0.510825624 2.691243 74 -1.38629436 0 -1.38629436 7 20 -0.1625189 T
4 -1.203972804 3.282789 58 -1.38629436 0 -1.38629436 6 0 -0.1625189 T
5 0.751416089 3.432373 62 -1.38629436 0 -1.38629436 6 0 0.3715636 T
Our data frame:
head(pros,5)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
We looked at a trickier example in Homework 4. One entry from the file looked like this:
<tr><td colspan="1" rowspan="1" class="date"><div style="display:block;" class="skedStartDateSite">Wed Oct 7, 2015</div><div style="display:none;" class="skedStartDateLocal">Wed Oct 7, 2015</div></td><td colspan="1" rowspan="1" class="team"><!-- Away --><a onclick="loadTeamSpotlight(jQuery(this));" rel="NYR" shape="rect" href="javascript:void(0);"><img title="New York Rangers" alt="New York Rangers" class="team-logo " src="http://cdn.nhle.com/nhl/images/logos/teams/rangers_logo.svgz?v=8.47"/></a><div class="teamName"><a onclick="loadTeamSpotlight(jQuery(this));" rel="NYR" shape="rect" href="javascript:void(0);">NY Rangers</a></div></td><td colspan="1" rowspan="1" class="team"><!-- Home --><a onclick="loadTeamSpotlight(jQuery(this));" rel="CHI" shape="rect" href="javascript:void(0);"><img title="Chicago Blackhawks" alt="Chicago Blackhawks" class="team-logo " src="http://cdn.nhle.com/nhl/images/logos/teams/blackhawks_logo.svgz?v=8.47"/></a><div class="teamName"><a onclick="loadTeamSpotlight(jQuery(this));" rel="CHI" shape="rect" href="javascript:void(0);">Chicago</a></div></td><td colspan="1" rowspan="1" class="time"><!-- Time --><div style="display:block;" class="skedStartTimeEST">8:00 PM ET</div><div style="display:none;" class="skedStartTimeLocal">7:00 PM CT</div></td><td colspan="1" rowspan="1" class="tvInfo"><!-- Network -->
NBCSN</td><td colspan="1" rowspan="1" class="skedLinks"><!-- Button Links --><a class="btn" shape="rect" href="http://www.ticketmaster.com/Chicago-Blackhawks-tickets/artist/805913?intcmp=tm204060&wt.mc_id=NHL_LEAGUE_TIX_LIST_PR_CHI"><span>TICKETS›</span></a><a class="btn" shape="rect" href="http://www.ticketsnow.com/NHL/chicago-blackhawks-tickets/?intcmp=tm204030&wt.mc_id=NHL_LEAGUE_TIX_LIST_EX_CHI"><span>BUY/SELL›</span></a></td></tr>
We read it in as a raw text file:
nhl1516 = readLines('http://www.stat.cmu.edu/~ryantibs/statcomp-F15/homework/NHL1516.html')
length(nhl1516)
## [1] 3083
As a brief example, let’s remember how we extracted the date. The relevant part of each entry looked like this:
<div style="display:block;" class="skedStartDateSite">Wed Oct 7, 2015</div><div style="display:none;" class="skedStartDateLocal">Wed Oct 7, 2015</div>
We used regular expressions to isolate the parts we were interested in. What does .*
do? Why do we match all the way out to skedStartDateLocal
?
datematch = regexpr("DateSite\">.*</div>.*skedStartDateLocal",nhl1516)
dates = regmatches(x=nhl1516,m=datematch)
head(dates)
## [1] "DateSite\">Wed Oct 7, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
## [2] "DateSite\">Wed Oct 7, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
## [3] "DateSite\">Wed Oct 7, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
## [4] "DateSite\">Wed Oct 7, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
## [5] "DateSite\">Thu Oct 8, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
## [6] "DateSite\">Thu Oct 8, 2015</div><div style=\"display:none;\" class=\"skedStartDateLocal"
We can then pull out just the part of the string we want:
date = substr(dates,start=11,stop=25)
head(date)
## [1] "Wed Oct 7, 2015" "Wed Oct 7, 2015" "Wed Oct 7, 2015" "Wed Oct 7, 2015"
## [5] "Thu Oct 8, 2015" "Thu Oct 8, 2015"
It’s straightforward to generate vectors of TRUE
and FALSE
in R. We can then use those vectors to extract interesting portions of the data.
For example, going back to the prostate data set, suppose we wanted to compare young vs. old patients:
young_patients = (pros$age <= 55)
old_patients = (pros$age > 55)
head(young_patients)
## [1] TRUE FALSE FALSE FALSE FALSE TRUE
We could then compare the younger and older patients:
mean( pros$lweight[young_patients] )
## [1] 3.349398
mean( pros$lweight[old_patients] )
## [1] 3.661074
We can also look directly at the data frame, restricted to those values:
head(pros[young_patients,])
## lcavol lweight age lbph svi lcp gleason pgg45
## 1 -0.579818495 2.769459 50 -1.386294 0 -1.386294 6 0
## 6 -1.049822124 3.228826 50 -1.386294 0 -1.386294 6 0
## 9 -0.776528789 3.539509 47 -1.386294 0 -1.386294 6 0
## 19 -0.562118918 3.267666 41 -1.386294 0 -1.386294 6 0
## 34 0.009950331 3.267666 54 -1.386294 0 -1.386294 6 0
## 49 1.745715531 3.498022 43 -1.386294 0 -1.386294 6 0
## lpsa train
## 1 -0.4307829 TRUE
## 6 0.7654678 TRUE
## 9 1.0473190 FALSE
## 19 1.5581446 TRUE
## 34 2.0215476 FALSE
## 49 2.5915164 FALSE
Why do we need the comma?
Let’s revisit an example from the lecture. Using the prostate cancer data, we had a proposed model: \[\mathrm{lpsa} = a \mathrm{lcavol} + b \mathrm{lweight}\]
We used lm
to fit this model to the data:
pros.lm.1 = lm(lpsa ~ lcavol + lweight, data=pros)
summary(pros.lm.1)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight, data = pros)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61051 -0.44135 -0.04666 0.53542 1.90424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.81344 0.65309 -1.246 0.216033
## lcavol 0.65154 0.06693 9.734 6.75e-16 ***
## lweight 0.66472 0.18414 3.610 0.000494 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7419 on 94 degrees of freedom
## Multiple R-squared: 0.5955, Adjusted R-squared: 0.5869
## F-statistic: 69.19 on 2 and 94 DF, p-value: < 2.2e-16
We specified the model with lpsa ~ lcavol + lweight
.
We could extract the coefficients:
coef(pros.lm.1)
## (Intercept) lcavol lweight
## -0.8134373 0.6515421 0.6647215
We could also extract the fitted values for each data point:
head(fitted(pros.lm.1))
## 1 2 3 4 5 6
## 0.6497055 0.7453923 0.6426654 0.5842642 1.9577140 0.6488295
This is one of the reasons we use computers – they’re good at doing the same thing over and over again without making mistakes or getting bored. In R, we have three main tools:
for
and while
loops
Functions from the apply
family
Defining our own functions
for
loops execute a set of commands a fixed number of times, while
loops execute a set of commands until a condition is satisfied. Let’s look at a simple for loop:
for(i in 1:ncol(pros)){
print( colnames(pros)[i] )
print( mean(pros[,i]) )
}
## [1] "lcavol"
## [1] 1.35001
## [1] "lweight"
## [1] 3.628943
## [1] "age"
## [1] 63.86598
## [1] "lbph"
## [1] 0.1003556
## [1] "svi"
## [1] 0.2164948
## [1] "lcp"
## [1] -0.1793656
## [1] "gleason"
## [1] 6.752577
## [1] "pgg45"
## [1] 24.38144
## [1] "lpsa"
## [1] 2.478387
## [1] "train"
## [1] 0.6907216
In this case, we could have accomplished a similar result using lapply
:
lapply(pros,mean)
## $lcavol
## [1] 1.35001
##
## $lweight
## [1] 3.628943
##
## $age
## [1] 63.86598
##
## $lbph
## [1] 0.1003556
##
## $svi
## [1] 0.2164948
##
## $lcp
## [1] -0.1793656
##
## $gleason
## [1] 6.752577
##
## $pgg45
## [1] 24.38144
##
## $lpsa
## [1] 2.478387
##
## $train
## [1] 0.6907216
Functions are a tool to break big problems into a series of small problems. As an added bonus, you may find that the solutions to the small problems are something you can use over and over again.
Basic syntax:
first_col_summary = function(df){
this.name = colnames(df)[1]
this.mean = mean(df[,1])
this.sd = sd(df[,1])
return( list(name=this.name,mean = this.mean,sd=this.sd) )
}
first_col_summary(pros)
## $name
## [1] "lcavol"
##
## $mean
## [1] 1.35001
##
## $sd
## [1] 1.178625
If you create a variable within the function, it won’t “survive” outside the function.
numberchanger = function(){
the_best_number = 4
return(the_best_number)
}
numberchanger()
## [1] 4
#the_best_number
If a variable outside the function and inside the function have the same name, the function will ignore the “outside” copy of the variable. It won’t modify it, and will work with its own copy.
the_best_number = 2
numberchanger = function(){
the_best_number = 4
return(the_best_number)
}
numberchanger()
## [1] 4
the_best_number
## [1] 2
If a variable isn’t defined within the function, but is defined outside, the function will refer to the outside value:
the_best_number = 2
numberchanger = function(){
return(the_best_number)
}
numberchanger()
## [1] 2
the_best_number
## [1] 2
But remember, if I try to modify the_best_number
from inside my function, R will instead create a local copy of the_best_number
. Generally speaking, getting variable values this way instead of as arguments is a poor programming practice.
Usually, we ploy x,y
:
plot(pros$lpsa,pros$lcavol)
When plotting columns from the same data frame, we can instead plot y~x
:
plot(lpsa~lcavol,data=pros)
Label everything:
plot(lpsa~lcavol,data=pros,xlab='log cancer volume',ylab='log PSA score',main='Nice Plot!')
Use color to make your plots more informative:
plot(lpsa~lcavol,data=pros[young_patients,],xlab='log cancer volume',ylab='log PSA score',main='Nice Plot!',col='red')
points(pros$lcavol[old_patients],pros$lpsa[old_patients],col='blue')
And don’t forget how to plot things side-by-side by setting mfrow
:
par(mfrow=c(1,2))
plot(lpsa~lcavol,data=pros,xlab='log cancer volume',ylab='log PSA score',main='Super')
plot(lpsa~lweight,data=pros,xlab='log cancer weight',ylab='log PSA score',main='Sweet!')