Split-Apply-Combine and Plyr

Statistical Computing, 36-350

Tuesday April 3, 2018

Last week: Fitting models to data

Part I

Split-apply-combine

Reminder: iterating in R without for()

We’ve learned some tools in R for iteration without explicit for() loops:

Split-apply-combine

Today we will learn a general strategy for data analysis/management that can be summmarized in three conceptual steps:

These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the apply() family

Simple but powerful

Does split-apply-combine sound simple? It is, but it’s very powerful when combined with the right data structures

Strikes data set

Data set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:

strikes.df = 
  read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries
## [1] 625   8
head(strikes.df)
##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

An interesting question

Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?

washington madison

How could we approach this?

Work with just one chunk of data

Step 0: design a function that will work on one chunk of data

So let’s write code to do regression on the data from (say) just Italy

strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)
## [1] 35  8
head(strikes.df.italy)
##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA
italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)
## 
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -930.2 -411.6 -137.3  387.2 1901.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -738.75    1200.62  -0.615    0.543
## left.parliament    40.29      27.76   1.451    0.156
## 
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared:  0.05999,    Adjusted R-squared:  0.0315 
## F-statistic: 2.106 on 1 and 33 DF,  p-value: 0.1562
plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
     main="Italy strike volume versus leftwing alignment",
     ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)

Functionalization

Now let’s turn this into a function

my.strike.lm = function(country.df) {
  coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way
##     (Intercept) left.parliament 
##      -738.74531        40.29109
coef(italy.lm) # Old way, same result
##     (Intercept) left.parliament 
##      -738.74531        40.29109

Split our data into appropriate chunks

Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function split() is often helpful: split(df, f=my.factor) splits a data frame df into several data frames, defined by constant levels of the factor my.factor

So we want to split strikes.df into 18 smaller data frames, each of which has the data for just one country

strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list
## [1] "list"
names(strikes.by.country) # It has one element for each country
##  [1] "Australia"   "Austria"     "Belgium"     "Canada"      "Denmark"    
##  [6] "Finland"     "France"      "Germany"     "Ireland"     "Italy"      
## [11] "Japan"       "Netherlands" "New.Zealand" "Norway"      "Sweden"     
## [16] "Switzerland" "UK"          "USA"
head(strikes.by.country$Italy) # Same as what we saw before
##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

Apply our function and combine the results

Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions lapply() or sapply() are often helpful

So we want to apply my.strikes.lm() to each data frame in strikes.by.country. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use sapply()

strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs 
##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647
# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
     xlab="", ylab="Regression coefficient", 
     main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
     labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")

Part II

Plyr: a*ply() and l*aply()

The plyr package

Most popular R package of all time (most downloads): plyr. This is true for a good reason! Provides extremely useful family of apply-like functions

a*ply(): input is an array

The signature for all a*ply() functions is:

a*ply(.data, .margins, .fun, ...)

Note that this resembles:

apply(X, MARGIN, FUN, ...)

Examples of a*ply()

my.array = array(1:27, c(3,3,3))
rownames(my.array) = c("Curly", "Larry", "Moe")
colnames(my.array) = c("Groucho", "Harpo", "Zeppo")
dimnames(my.array)[[3]] = c("Bart", "Lisa", "Maggie")
my.array
## , , Bart
## 
##       Groucho Harpo Zeppo
## Curly       1     4     7
## Larry       2     5     8
## Moe         3     6     9
## 
## , , Lisa
## 
##       Groucho Harpo Zeppo
## Curly      10    13    16
## Larry      11    14    17
## Moe        12    15    18
## 
## , , Maggie
## 
##       Groucho Harpo Zeppo
## Curly      19    22    25
## Larry      20    23    26
## Moe        21    24    27
library(plyr)
aaply(my.array, 1, sum) # Get back an array
## Curly Larry   Moe 
##   117   126   135
adply(my.array, 1, sum) # Get back a data frame
##      X1  V1
## 1 Curly 117
## 2 Larry 126
## 3   Moe 135
alply(my.array, 1, sum) # Get back a list
## $`1`
## [1] 117
## 
## $`2`
## [1] 126
## 
## $`3`
## [1] 135
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##      X1
## 1 Curly
## 2 Larry
## 3   Moe
aaply(my.array, 2:3, sum) # Get back a 3 x 3 array
##          X2
## X1        Bart Lisa Maggie
##   Groucho    6   33     60
##   Harpo     15   42     69
##   Zeppo     24   51     78
adply(my.array, 2:3, sum) # Get back a data frame
##        X1     X2 V1
## 1 Groucho   Bart  6
## 2   Harpo   Bart 15
## 3   Zeppo   Bart 24
## 4 Groucho   Lisa 33
## 5   Harpo   Lisa 42
## 6   Zeppo   Lisa 51
## 7 Groucho Maggie 60
## 8   Harpo Maggie 69
## 9   Zeppo Maggie 78
alply(my.array, 2:3, sum) # Get back a list
## $`1`
## [1] 6
## 
## $`2`
## [1] 15
## 
## $`3`
## [1] 24
## 
## $`4`
## [1] 33
## 
## $`5`
## [1] 42
## 
## $`6`
## [1] 51
## 
## $`7`
## [1] 60
## 
## $`8`
## [1] 69
## 
## $`9`
## [1] 78
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##        X1     X2
## 1 Groucho   Bart
## 2   Harpo   Bart
## 3   Zeppo   Bart
## 4 Groucho   Lisa
## 5   Harpo   Lisa
## 6   Zeppo   Lisa
## 7 Groucho Maggie
## 8   Harpo Maggie
## 9   Zeppo Maggie

l*ply(): input is a list

The signature for all l*ply() functions is:

l*ply(.data, .fun, ...)

Note that this resembles:

lapply(X, FUN, ...)

Examples of l*ply()

my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back an array
##      1                   2                 
## [1,] "-2.94398824455701" "2.88538944857663"
## [2,] "a"                 "z"               
## [3,] "365"               "21198"
ldply(my.list, range) # Get back a data frame
##    .id                V1               V2
## 1 nums -2.94398824455701 2.88538944857663
## 2 lets                 a                z
## 3 pops               365            21198
llply(my.list, range) # Get back a list
## $nums
## [1] -2.943988  2.885389
## 
## $lets
## [1] "a" "z"
## 
## $pops
## [1]   365 21198
laply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error: Results must have one or more dimensions.
ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths
llply(my.list, summary) # Works just fine
## $nums
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.943988 -0.624965 -0.009705  0.012965  0.687484  2.885389 
## 
## $lets
##    Length     Class      Mode 
##        26 character character 
## 
## $pops
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     365    1080    2838    4246    4968   21198

The fourth option for *

The fourth option for * is _: the function a_ply() (or l*ply()) has no explicit return object, but still runs the given function over the given array (or list), possibly producing side effects

par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(my.array, 2:3, plot, ylim=range(my.array), pch=19, col=6)

Part III

Plyr: d*ply()

d*ply() : the input is a data frame

The signature for all d*ply() functions is:

d*ply(.data, .variables, .fun, ...)

Note that this resembles:

tapply(X, INDEX, FUN, ...)

Strikes data set, revisited

# Regression coefficients separately for each country, old way:
strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)
##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647
# Getting regression coefficient separately for each country, new way:
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()
##            
## country     (Intercept) left.parliament
##   Australia   414.77123      -0.8638052
##   Austria     423.07728      -8.2108864
##   Belgium     -56.92678       8.4474627
##   Canada     -227.82177      17.6766029
##   Denmark   -1399.35735      34.3447662
##   Finland     108.22451      12.8422018
strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame
##     country (Intercept) left.parliament
## 1 Australia   414.77123      -0.8638052
## 2   Austria   423.07728      -8.2108864
## 3   Belgium   -56.92678       8.4474627
## 4    Canada  -227.82177      17.6766029
## 5   Denmark -1399.35735      34.3447662
## 6   Finland   108.22451      12.8422018
strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list
## $Australia
##     (Intercept) left.parliament 
##     414.7712254      -0.8638052 
## 
## $Austria
##     (Intercept) left.parliament 
##      423.077279       -8.210886 
## 
## $Belgium
##     (Intercept) left.parliament 
##      -56.926780        8.447463 
## 
## $Canada
##     (Intercept) left.parliament 
##       -227.8218         17.6766 
## 
## $Denmark
##     (Intercept) left.parliament 
##     -1399.35735        34.34477 
## 
## $Finland
##     (Intercept) left.parliament 
##        108.2245         12.8422

Splitting on two (or more) variables

The function d*ply() makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables argument

# First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country 
# pre and post 1975
strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows
## [1] 36  4
head(strikes.coefs.1975)
##     country yearPre1975 (Intercept) left.parliament
## 1 Australia       FALSE   973.34088     -11.8094991
## 2 Australia        TRUE  -169.59900      12.0170866
## 3   Austria       FALSE    19.51823      -0.3470889
## 4   Austria        TRUE   400.83004      -7.7051918
## 5   Belgium       FALSE -4182.06650     148.0049261
## 6   Belgium        TRUE  -103.67439       9.5802824
# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows
## [1] 36  4
head(strikes.coefs.1975)
##     country I(year <= 1975) (Intercept) left.parliament
## 1 Australia           FALSE   973.34088     -11.8094991
## 2 Australia            TRUE  -169.59900      12.0170866
## 3   Austria           FALSE    19.51823      -0.3470889
## 4   Austria            TRUE   400.83004      -7.7051918
## 5   Belgium           FALSE -4182.06650     148.0049261
## 6   Belgium            TRUE  -103.67439       9.5802824

Parallelization

What happens if we have a really large data set and we want to use split-apply-combine?

If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel

The plyr functions make this quite easy: let’s take a look at the full signature for daply():

daply(.data, .variables, .fun = NULL, ..., .progress = "none",
  .inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
  .paropts = NULL)

The second to last argument .parallel (default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach package

The last argument .paropts is for more advanced parallelization, these are additional arguments to be passed to foreach

For more, read the foreach package first. May take some time to set up the parallel backend (this is often system specific)

But once set up, parallelization is simple and beautiful with **ply()! The difference is just, e.g.,

daply(strikes.df, .(country), my.strike.lm)

versus

daply(strikes.df, .(country), my.strike.lm, .parallel=TRUE)

Summary