- A Quick Detour And Reminder of What R Can Do
- Why We "Hate" For Loops, Part 75
3 November 2014
We've used some tools for iterating over objects in R without for
loops:
my.object[pick.these.rows,]
: retrieve part of the data according to some conditionThese are tools that, when put together, give us three general steps:
Sounds simple? It is, but it's powerful when combined with data structures (see Hadoop/MapReduce for how this makes you billions)
Our previous functions work well with well-composed data: apply
on matrices, sapply/lapply
on lists and vectors. What about ragged data – where the dimensions of each object aren't necessarily the same?
For this: Start with data frames, though we'll be going beyond this eventually.
Politics and Labor Action: Does having a friendlier government make labor action more or less likely?
March on Washington, 1963; Wisconsin Protests, 2011
Compiled by Bruce Western, Sociology, Harvard.
strikes <- read.csv("strikes.csv")
head(strikes)
## country year strike.volume unemployment inflation left.parliament ## 1 Australia 1951 296 1.3 19.8 43.0 ## 2 Australia 1952 397 2.2 17.2 43.0 ## 3 Australia 1953 360 2.5 4.3 43.0 ## 4 Australia 1954 3 1.7 0.7 47.0 ## 5 Australia 1955 326 1.4 2.0 38.5 ## 6 Australia 1956 352 1.8 6.3 38.5 ## centralization density ## 1 0.3748588 NA ## 2 0.3751829 NA ## 3 0.3745076 NA ## 4 0.3710170 NA ## 5 0.3752675 NA ## 6 0.3716072 NA
Does having a friendlier government make labor action more or less likely?
becomes
Is there a relationship between a country's ruling party alignment (left vs. right) and the volume of strikes?
Lots of ways to approach this problem: simplest is to split it by country.
Take Italy:
italy.strikes <- subset(strikes, country == "Italy")
Or, if you prefer,
italy.strikes <- strikes[strikes$country == "Italy",] head(italy.strikes)
## country year strike.volume unemployment inflation left.parliament ## 311 Italy 1951 437 8.8 14.3 37.5 ## 312 Italy 1952 337 9.5 1.9 37.5 ## 313 Italy 1953 545 10.0 1.4 40.2 ## 314 Italy 1954 493 8.7 2.4 40.2 ## 315 Italy 1955 511 7.5 2.3 40.2 ## 316 Italy 1956 372 9.3 3.4 40.2 ## centralization density ## 311 0.2513799 NA ## 312 0.2489860 NA ## 313 0.2482739 NA ## 314 0.2466577 NA ## 315 0.2540366 NA ## 316 0.2457069 NA
italy.fit <- lm(strike.volume ~ left.parliament, data=italy.strikes) plot(strike.volume ~ left.parliament, data=italy.strikes, main="Italy Strike Volume Versus Left-Wing Alignment") abline(italy.fit, col=2)
Tedious and dangerous to do this repeatedly – typos abound. How can we do this in an easier way?
First: We need subsets for every country. split
does this nicely:
strikes.split <- split (strikes, strikes$country) names(strikes.split)
## [1] "Australia" "Austria" "Belgium" "Canada" "Denmark" ## [6] "Finland" "France" "Germany" "Ireland" "Italy" ## [11] "Japan" "Netherlands" "New.Zealand" "Norway" "Sweden" ## [16] "Switzerland" "UK" "USA"
Now, let's generalize our function. We want the linear model coefficients:
my.strike.lm <- function (country.df) { lm(strike.volume ~ left.parliament, data=country.df)$coefficients } my.strike.lm (subset(strikes, country == "Italy"))
## (Intercept) left.parliament ## -738.74531 40.29109
Could for
-loop it:
strike.coefs <- NULL my.countries <- c("France", "Italy", "USA") for (this.country in my.countries) { strike.coefs <- cbind(strike.coefs, my.strike.lm (subset(strikes, country == this.country))) } colnames(strike.coefs) <- my.countries strike.coefs
## France Italy USA ## (Intercept) 202.4261408 -738.74531 111.440651 ## left.parliament -0.4255319 40.29109 5.918647
strike.coefs <- lapply (strikes.split[1:3], my.strike.lm) strike.coefs
## $Australia ## (Intercept) left.parliament ## 414.7712254 -0.8638052 ## ## $Austria ## (Intercept) left.parliament ## 423.077279 -8.210886 ## ## $Belgium ## (Intercept) left.parliament ## -56.926780 8.447463
Combine step:
do.call(cbind, strike.coefs)
## Australia Austria Belgium ## (Intercept) 414.7712254 423.077279 -56.926780 ## left.parliament -0.8638052 -8.210886 8.447463
Or, one step:
strike.coefs <- sapply (strikes.split[1:3], my.strike.lm) strike.coefs
## Australia Austria Belgium ## (Intercept) 414.7712254 423.077279 -56.926780 ## left.parliament -0.8638052 -8.210886 8.447463
We can do it in one line!
coefs <- simplify2array(by (strikes, strikes$country, my.strike.lm)) coefs
## Australia Austria Belgium Canada Denmark ## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735 ## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477 ## Finland France Germany Ireland Italy ## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531 ## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109 ## Japan Netherlands New.Zealand Norway Sweden ## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704 ## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072 ## Switzerland UK USA ## (Intercept) -5.1988836 936.10154 111.440651 ## left.parliament 0.3203399 -13.42792 5.918647
plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", main="Countrywise Labor Activity By Left-Wing Score") axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs), las=2,cex.axis=0.5) abline(h=0,col="grey")
my.strike.lm.better <- function (country.df) { lm(strike.volume ~ left.parliament + unemployment + inflation, data=country.df)$coefficients } coefs2 <- simplify2array(by (strikes, strikes$country, my.strike.lm.better)) coefs2[,1:4]
## Australia Austria Belgium Canada ## (Intercept) 157.9191118 600.6777769 -243.4822938 167.07123 ## left.parliament 0.5658674 -11.2441604 12.4516118 13.43864 ## unemployment -1.1181489 -10.9216990 0.3578217 -48.17903 ## inflation 30.4666061 -0.5923972 10.2673539 27.21807
plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", main="Countrywise Labor Activity By Left-Wing Score") axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs), las=2,cex.axis=0.5) points (coefs2[2,], col="red"); abline(h=0,col="grey")
by()
might have unexpected side effects when we get to higher dimensions. Consider this real example:
print(load ("all-team-seasons-20132014.RData"))
## [1] "teamtable"
head (teamtable)
## FAC_WIN FAC_LOSE PENL_TAKEN PENL_DRAWN HIT HIT_TAKEN CF CA FF FA SF ## MTL 4 1 2 1 3 6 12 5 10 4 4 ## MTL1 3 2 0 0 0 3 11 9 7 5 6 ## MTL2 6 7 2 2 9 7 12 13 11 9 10 ## MTL3 5 2 2 1 5 6 21 12 16 8 9 ## MTL4 2 1 3 0 0 0 2 1 2 1 2 ## MTL5 3 2 2 2 0 1 3 1 1 1 1 ## SA GF GA Off Neu Def TOI team opponent season gcode scorediffcat ## MTL 3 1 0 2 3 0 632.0 MTL TOR 20132014 20001 1 ## MTL1 3 0 1 1 3 1 459.5 MTL TOR 20132014 20001 2 ## MTL2 8 1 0 3 4 6 641.0 MTL TOR 20132014 20001 3 ## MTL3 5 0 1 3 2 2 641.0 MTL TOR 20132014 20001 4 ## MTL4 1 0 0 2 1 0 91.0 MTL TOR 20132014 20001 2 ## MTL5 1 0 1 4 1 0 190.5 MTL TOR 20132014 20001 3 ## gamestate home ## MTL 1 1 ## MTL1 1 1 ## MTL2 1 1 ## MTL3 1 1 ## MTL4 2 1 ## MTL5 2 1
my.agg <- aggregate (teamtable[,2:7], list(teamtable$gcode), sum) my.agg.2 <- aggregate (teamtable[,1:8], list(teamtable$gcode, teamtable$team), sum) my.agg.3 <- by (teamtable[,1:8], list(teamtable$team, teamtable$gcode), colSums) split.teamtable <- split (teamtable, list(teamtable$team, teamtable$gcode))