Reminder: split-apply-combine and plyr

d*ply() : the input is a data frame

The signature for all d*ply() functions is:

d*ply(.data, .variables, .fun, ...)

Note that this looks like:

tapply(X, INDEX, FUN, ...)

Strikes data set, revisited

Recall, data set on political economy of strikes:

strikes.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
head(strikes.df)
##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA
# Function to compute coefficients from regressing number of strikes (per 
# 1000 workers) on leftwing share of the government
my.strike.lm = function(country.df) {
  coef(lm(strike.volume ~ left.parliament, data=country.df))
}

# Getting regression coefficients separately for each country, old way:
strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)
##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

(Continued)

# Getting regression coefficient separately for each country, new way, in 
# three formats:
library(plyr)
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()
##            
## country     (Intercept) left.parliament
##   Australia   414.77123      -0.8638052
##   Austria     423.07728      -8.2108864
##   Belgium     -56.92678       8.4474627
##   Canada     -227.82177      17.6766029
##   Denmark   -1399.35735      34.3447662
##   Finland     108.22451      12.8422018
strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame
##     country (Intercept) left.parliament
## 1 Australia   414.77123      -0.8638052
## 2   Austria   423.07728      -8.2108864
## 3   Belgium   -56.92678       8.4474627
## 4    Canada  -227.82177      17.6766029
## 5   Denmark -1399.35735      34.3447662
## 6   Finland   108.22451      12.8422018
strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list
## $Australia
##     (Intercept) left.parliament 
##     414.7712254      -0.8638052 
## 
## $Austria
##     (Intercept) left.parliament 
##      423.077279       -8.210886 
## 
## $Belgium
##     (Intercept) left.parliament 
##      -56.926780        8.447463 
## 
## $Canada
##     (Intercept) left.parliament 
##       -227.8218         17.6766 
## 
## $Denmark
##     (Intercept) left.parliament 
##     -1399.35735        34.34477 
## 
## $Finland
##     (Intercept) left.parliament 
##        108.2245         12.8422

Splitting on two (or more) variables

The function d*ply() makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables argument

# First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country 
# pre and post 1975
strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows
## [1] 36  4
head(strikes.coefs.1975)
##     country yearPre1975 (Intercept) left.parliament
## 1 Australia       FALSE   973.34088     -11.8094991
## 2 Australia        TRUE  -169.59900      12.0170866
## 3   Austria       FALSE    19.51823      -0.3470889
## 4   Austria        TRUE   400.83004      -7.7051918
## 5   Belgium       FALSE -4182.06650     148.0049261
## 6   Belgium        TRUE  -103.67439       9.5802824

(Continued)

# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows
## [1] 36  4
head(strikes.coefs.1975)
##     country I(year <= 1975) (Intercept) left.parliament
## 1 Australia           FALSE   973.34088     -11.8094991
## 2 Australia            TRUE  -169.59900      12.0170866
## 3   Austria           FALSE    19.51823      -0.3470889
## 4   Austria            TRUE   400.83004      -7.7051918
## 5   Belgium           FALSE -4182.06650     148.0049261
## 6   Belgium            TRUE  -103.67439       9.5802824

Parallelization

What happens if we have a really large data set and we want to use split-apply-combine?

If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel

The plyr functions make this quite easy: let’s take a look at the full signature for daply():

daply(.data, .variables, .fun = NULL, ..., .progress = "none",
  .inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
  .paropts = NULL)

The second to last argument .parallel (default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach package

The last argument .paropts is for more advanced parallelization, these are additional arguments to be passed to foreach

(Continued)

For more, read the foreach package first. May take some time to set up the parallel backend (this is often system specific)

But once set up, parallelization is simple and beautiful with **ply()! The difference is just, e.g., daply(strikes.df, .(country), my.strike.lm) versus daply(strikes.df, .(country), my.strike.lm, .parallel=TRUE)