Reminder: iterating in R without `for()`

We’ve learned some tools in R for iteration without explicit for() loops:

Indexing with conditionals + vectorization
apply(): apply a function to rows or columns of a matrix or data frame
lapply(): apply a function to elements of a list or vector
sapply(): same as the above, but simplify the output (if possible)
tapply(): apply a function to levels of a factor vector

Split-apply-combine

Today we will learn a general strategy that can be summmarized in three conceptual steps:

Split whatever data object we have into meaningful chunks
Apply the function of interest to each element in this division
Combine the results into a new object of the desired structure

These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the apply() family

Simple but powerful

Does split-apply-combine sound simple? It is, but it’s very powerful when combined with the right data structures

As usual, compared to explicit for() loops, often requires far less code
Fits nicely with our previous ideas of top-down function design, simply at a broader scope
Sets you in the right direction towards learning how to use MapReduce/Hadoop for really, really big data sets

Strikes data set

Data set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:

country, year: country and year of data collection
strike.volume: days on strike per 1000 workers
unemployment: unemployment rate
inflation: inflation rate
left.parliament: leftwing share of the goverment
centralization: centralization of unions
density: density of unions

strikes.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries

## [1] 625   8

head(strikes.df)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

An interesting question

Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?

washington madison

How could we approach this?

Worst way: by hand, write 18 separate code blocks
Bad way: explicit for() loop, where we loop over countries
Best way: split appropriately, then use sapply()

Work with just one chunk of data

Step 0: design a function that will work on one chunk of data

So let’s write code to do regression on the data from (say) just Italy

strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)

## [1] 35  8

head(strikes.df.italy)

##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)

## 
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -930.2 -411.6 -137.3  387.2 1901.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -738.75    1200.62  -0.615    0.543
## left.parliament    40.29      27.76   1.451    0.156
## 
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared:  0.05999,    Adjusted R-squared:  0.0315 
## F-statistic: 2.106 on 1 and 33 DF,  p-value: 0.1562

plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
     main="Italy strike volume versus leftwing alignment",
     ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)

(Continued)

Now let’s turn this into a function

my.strike.lm = function(country.df) {
  coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way

##     (Intercept) left.parliament 
##      -738.74531        40.29109

coef(italy.lm) # Old way, same result

##     (Intercept) left.parliament 
##      -738.74531        40.29109

Split our data into appropriate chunks

Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function split() is often helpful: split(df, f=my.factor) splits a data frame df into several data frames, defined by constant levels of the factor my.factor

So we want to split strikes.df into 18 smaller data frames, each of which has the data for just one country

strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list

## [1] "list"

names(strikes.by.country) # It has one element for each country

##  [1] "Australia"   "Austria"     "Belgium"     "Canada"      "Denmark"    
##  [6] "Finland"     "France"      "Germany"     "Ireland"     "Italy"      
## [11] "Japan"       "Netherlands" "New.Zealand" "Norway"      "Sweden"     
## [16] "Switzerland" "UK"          "USA"

head(strikes.by.country$Italy) # Same as what we saw before

##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

Apply our function and combine the results

Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions lapply() or sapply() are often helpful

So we want to apply my.strikes.lm() to each data frame in strikes.by.country. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use sapply()

strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs

##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
     xlab="", ylab="Regression coefficient", 
     main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
     labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")

Split-Apply-Combine

Reminder: iterating in R without for()