Statistical Computing, 36-350
Monday November 9, 2015
We’ve used some tools for iterating over objects in R without for()
loops:
subset()
: retrieve part of the data according to some conditionapply()
: takes a matrix and a margin, applies a functionsapply()
or lapply()
: takes a list (or vector), applies a functionc()
or rbind()
or cbind()
: concatenate these objects in a known patternToday we will learn a workflow that can be summmarized in three general steps:
Sounds simple? It is, but it’s powerful when combined with data structures (see Hadoop/MapReduce for how this makes you billions)
Our previous functions work well with well-composed data: apply()
on matrices, sapply()
or lapply()
on lists and vectors. What about ragged data—where the dimensions of each object aren’t necessarily the same?
For this: start with data frames, though we’ll be going beyond this eventually
Politics and labor action: does having a friendlier government make labor action more or less likely?
Compiled by Bruce Western, Sociology Dept., Harvard.
strikes = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/strikes.csv")
head(strikes)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
“Does having a friendlier government make labor action more or less likely?”
becomes
“Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?”
Lots of ways to approach this problem: simplest is to split it by country.
subset()
, split()
, tapply()
Take Italy:
italy.strikes = subset(strikes, country=="Italy")
Or, if you prefer,
italy.strikes = strikes[strikes$country=="Italy",]
head(italy.strikes)
## country year strike.volume unemployment inflation left.parliament
## 311 Italy 1951 437 8.8 14.3 37.5
## 312 Italy 1952 337 9.5 1.9 37.5
## 313 Italy 1953 545 10.0 1.4 40.2
## 314 Italy 1954 493 8.7 2.4 40.2
## 315 Italy 1955 511 7.5 2.3 40.2
## 316 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 311 0.2513799 NA
## 312 0.2489860 NA
## 313 0.2482739 NA
## 314 0.2466577 NA
## 315 0.2540366 NA
## 316 0.2457069 NA
italy.fit = lm(strike.volume ~ left.parliament, data=italy.strikes)
plot(italy.strikes$left.parliament, italy.strikes$strike.volume,
main="Italy strike volume versus left-wing alignment",
xlab="Strike volume", ylab="Left-wing alignment")
abline(italy.fit, col=2)
Tedious and dangerous to do this repeatedly—typos abound. How can we do this in an easier way?
First: we need subsets for every country. split()
does this nicely:
strikes.split = split(strikes, strikes$country)
class(strikes.split)
## [1] "list"
names(strikes.split)
## [1] "Australia" "Austria" "Belgium" "Canada" "Denmark"
## [6] "Finland" "France" "Germany" "Ireland" "Italy"
## [11] "Japan" "Netherlands" "New.Zealand" "Norway" "Sweden"
## [16] "Switzerland" "UK" "USA"
Now, let’s generalize our function. We want the linear model coefficients:
my.strike.lm = function (country.df) {
return(lm(strike.volume ~ left.parliament, data=country.df)$coefficients)
}
my.strike.lm(subset(strikes, country=="Italy"))
## (Intercept) left.parliament
## -738.74531 40.29109
for()
loop …strike.coefs = NULL
my.countries = c("France", "Italy", "USA")
for (this.country in my.countries) {
strike.coefs = cbind(strike.coefs,
my.strike.lm (subset(strikes, country==this.country)))
}
colnames(strike.coefs) = my.countries
strike.coefs
## France Italy USA
## (Intercept) 202.4261408 -738.74531 111.440651
## left.parliament -0.4255319 40.29109 5.918647
strike.coefs = lapply (strikes.split[1:3], my.strike.lm)
strike.coefs
## $Australia
## (Intercept) left.parliament
## 414.7712254 -0.8638052
##
## $Austria
## (Intercept) left.parliament
## 423.077279 -8.210886
##
## $Belgium
## (Intercept) left.parliament
## -56.926780 8.447463
do.call(cbind, strike.coefs)
## Australia Austria Belgium
## (Intercept) 414.7712254 423.077279 -56.926780
## left.parliament -0.8638052 -8.210886 8.447463
Or, in one step:
strike.coefs = sapply(strikes.split[1:3], my.strike.lm)
strike.coefs
## Australia Austria Belgium
## (Intercept) 414.7712254 423.077279 -56.926780
## left.parliament -0.8638052 -8.210886 8.447463
coefs = sapply(strikes.split, my.strike.lm)
coefs
## Australia Austria Belgium Canada Denmark
## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
## Finland France Germany Ireland Italy
## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109
## Japan Netherlands New.Zealand Norway Sweden
## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704
## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072
## Switzerland UK USA
## (Intercept) -5.1988836 936.10154 111.440651
## left.parliament 0.3203399 -13.42792 5.918647
plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient",
main="Countrywise labor ativity by left-wing score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
abline(h=0,col="grey")
my.strike.lm.better = function(country.df) {
return(lm(strike.volume ~ left.parliament + unemployment + inflation,
data=country.df)$coefficients)
}
coefs2 = sapply(strikes.split, my.strike.lm.better)
coefs2[,1:4]
## Australia Austria Belgium Canada
## (Intercept) 157.9191118 600.6777769 -243.4822938 167.07123
## left.parliament 0.5658674 -11.2441604 12.4516118 13.43864
## unemployment -1.1181489 -10.9216990 0.3578217 -48.17903
## inflation 30.4666061 -0.5923972 10.2673539 27.21807
plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient",
main="Countrywise labor ativity by left-wing score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
points(coefs2[2,], col="red")
abline(h=0,col="grey")
apply()
like functions that will make your life even easier, and these are also extra helpful for large data sets (next time)Space, Right Arrow or swipe left to move to next slide, click help below for more details