Lecture 20: The Split-Apply-Combine Paradigm

3 November 2014

Agenda

A Quick Detour And Reminder of What R Can Do
Why We "Hate" For Loops, Part 75

Refresher

We've used some tools for iterating over objects in R without for loops:

subset/my.object[pick.these.rows,]: retrieve part of the data according to some condition
apply: takes a matrix and a margin, applies a function
sapply/lapply: takes a list (or vector), applies a function
c/rbind/cbind: concatenate these objects in a known pattern

General strategy

These are tools that, when put together, give us three general steps:

Split whatever data object we have into meaningful chunks
Apply the function of interest to this division
Combine the results into a new object.

Sounds simple? It is, but it's powerful when combined with data structures (see Hadoop/MapReduce for how this makes you billions)

Why Is This Useful Without The Billions?

This reinforces the pattern/function approach: What you want to do versus how you want to do it
If the full data set is big, and we've already done the splitting, this makes it tractable on smaller machines (see again: Hadoop, our well-meaning intentions for the midterm project)

An Important Application: Ragged Data

Our previous functions work well with well-composed data: apply on matrices, sapply/lapply on lists and vectors. What about ragged data – where the dimensions of each object aren't necessarily the same?

For this: Start with data frames, though we'll be going beyond this eventually.

A Sociologial Application

Politics and Labor Action: Does having a friendlier government make labor action more or less likely?

March on Washington, 1963; Wisconsin Protests, 2011

Data: Political Economy of Strikes

Compiled by Bruce Western, Sociology, Harvard.

Data frame of 8 columns: country, year, days on strike per 1000 workers, unemployment, inflation, left-wing share of gov’t, centralization of unions, union density
625 observations from 18 countries, 1951–1985
Note that since 18 × 35 = 630 > 625, some years missing from some countries

strikes <- read.csv("strikes.csv")

Data: Political Economy of Strikes

head(strikes)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

Research Question

Does having a friendlier government make labor action more or less likely?

becomes

Is there a relationship between a country's ruling party alignment (left vs. right) and the volume of strikes?

Lots of ways to approach this problem: simplest is to split it by country.

Split functions: subset, split, by,

Take Italy:

italy.strikes <- subset(strikes, country == "Italy")

Or, if you prefer,

italy.strikes <- strikes[strikes$country == "Italy",]
head(italy.strikes)

##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

(No one ever says Italy)

italy.fit <- lm(strike.volume ~ left.parliament, data=italy.strikes)
plot(strike.volume ~ left.parliament, data=italy.strikes, main="Italy Strike Volume Versus Left-Wing Alignment")
abline(italy.fit, col=2)

One Down, Seventeen To Go

Tedious and dangerous to do this repeatedly – typos abound. How can we do this in an easier way?

First: We need subsets for every country. split does this nicely:

strikes.split <- split (strikes, strikes$country)
names(strikes.split)

##  [1] "Australia"   "Austria"     "Belgium"     "Canada"      "Denmark"    
##  [6] "Finland"     "France"      "Germany"     "Ireland"     "Italy"      
## [11] "Japan"       "Netherlands" "New.Zealand" "Norway"      "Sweden"     
## [16] "Switzerland" "UK"          "USA"

One Down, Seventeen To Go

Now, let's generalize our function. We want the linear model coefficients:

my.strike.lm <- function (country.df) {
  lm(strike.volume ~ left.parliament, data=country.df)$coefficients
}
my.strike.lm (subset(strikes, country == "Italy"))

##     (Intercept) left.parliament 
##      -738.74531        40.29109

One Down, Seventeen To Go

Could for-loop it:

strike.coefs <- NULL
my.countries <- c("France", "Italy", "USA")
for (this.country in my.countries) {
  strike.coefs <- cbind(strike.coefs, 
    my.strike.lm (subset(strikes, country == this.country)))  
}
colnames(strike.coefs) <- my.countries
strike.coefs

##                      France      Italy        USA
## (Intercept)     202.4261408 -738.74531 111.440651
## left.parliament  -0.4255319   40.29109   5.918647

Easier if we've split!

strike.coefs <- lapply (strikes.split[1:3], my.strike.lm)
strike.coefs

## $Australia
##     (Intercept) left.parliament 
##     414.7712254      -0.8638052 
## 
## $Austria
##     (Intercept) left.parliament 
##      423.077279       -8.210886 
## 
## $Belgium
##     (Intercept) left.parliament 
##      -56.926780        8.447463

Easier if we've split!

Combine step:

do.call(cbind, strike.coefs)

##                   Australia    Austria    Belgium
## (Intercept)     414.7712254 423.077279 -56.926780
## left.parliament  -0.8638052  -8.210886   8.447463

Or, one step:

strike.coefs <- sapply (strikes.split[1:3], my.strike.lm)
strike.coefs

##                   Australia    Austria    Belgium
## (Intercept)     414.7712254 423.077279 -56.926780
## left.parliament  -0.8638052  -8.210886   8.447463

R makes it easy!

We can do it in one line!

coefs <- simplify2array(by (strikes, strikes$country, my.strike.lm))
coefs

##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

All together now:

plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", 
     main="Countrywise Labor Activity By Left-Wing Score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
abline(h=0,col="grey")

Add some more coefficients

my.strike.lm.better <- function (country.df) {
  lm(strike.volume ~ left.parliament + unemployment + inflation, 
     data=country.df)$coefficients
}
coefs2 <- simplify2array(by (strikes, strikes$country, 
                             my.strike.lm.better))
coefs2[,1:4]

##                   Australia     Austria      Belgium    Canada
## (Intercept)     157.9191118 600.6777769 -243.4822938 167.07123
## left.parliament   0.5658674 -11.2441604   12.4516118  13.43864
## unemployment     -1.1181489 -10.9216990    0.3578217 -48.17903
## inflation        30.4666061  -0.5923972   10.2673539  27.21807

All together now:

plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", 
     main="Countrywise Labor Activity By Left-Wing Score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
points (coefs2[2,], col="red"); abline(h=0,col="grey")

More functions:

by() might have unexpected side effects when we get to higher dimensions. Consider this real example:

print(load ("all-team-seasons-20132014.RData"))

## [1] "teamtable"

head (teamtable)

##      FAC_WIN FAC_LOSE PENL_TAKEN PENL_DRAWN HIT HIT_TAKEN CF CA FF FA SF
## MTL        4        1          2          1   3         6 12  5 10  4  4
## MTL1       3        2          0          0   0         3 11  9  7  5  6
## MTL2       6        7          2          2   9         7 12 13 11  9 10
## MTL3       5        2          2          1   5         6 21 12 16  8  9
## MTL4       2        1          3          0   0         0  2  1  2  1  2
## MTL5       3        2          2          2   0         1  3  1  1  1  1
##      SA GF GA Off Neu Def   TOI team opponent   season gcode scorediffcat
## MTL   3  1  0   2   3   0 632.0  MTL      TOR 20132014 20001            1
## MTL1  3  0  1   1   3   1 459.5  MTL      TOR 20132014 20001            2
## MTL2  8  1  0   3   4   6 641.0  MTL      TOR 20132014 20001            3
## MTL3  5  0  1   3   2   2 641.0  MTL      TOR 20132014 20001            4
## MTL4  1  0  0   2   1   0  91.0  MTL      TOR 20132014 20001            2
## MTL5  1  0  1   4   1   0 190.5  MTL      TOR 20132014 20001            3
##      gamestate home
## MTL          1    1
## MTL1         1    1
## MTL2         1    1
## MTL3         1    1
## MTL4         2    1
## MTL5         2    1

my.agg <- aggregate (teamtable[,2:7], list(teamtable$gcode), sum)
my.agg.2 <- aggregate (teamtable[,1:8], list(teamtable$gcode, teamtable$team), sum)
my.agg.3 <- by (teamtable[,1:8], list(teamtable$team, teamtable$gcode), colSums)
split.teamtable <- split (teamtable, list(teamtable$team, teamtable$gcode))