d*ply()
plyr
plyr
package makes life easier by making the input and outputs types (array, data frame, or list?) as explicit as possibled*ply()
: the input is a data frameThe signature for all d*ply()
functions is:
d*ply(.data, .variables, .fun, ...)
.data
: a data frame.variables
: variable (or variables) to split the data frame by.fun
: the function to be applied to each piece...
: additional arguments to be passed to the functionNote that this looks like:
tapply(X, INDEX, FUN, ...)
Recall, data set on political economy of strikes:
strikes.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
head(strikes.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
# Function to compute coefficients from regressing number of strikes (per
# 1000 workers) on leftwing share of the government
my.strike.lm = function(country.df) {
coef(lm(strike.volume ~ left.parliament, data=country.df))
}
# Getting regression coefficients separately for each country, old way:
strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)
## Australia Austria Belgium Canada Denmark
## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
## Finland France Germany Ireland Italy
## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109
## Japan Netherlands New.Zealand Norway Sweden
## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704
## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072
## Switzerland UK USA
## (Intercept) -5.1988836 936.10154 111.440651
## left.parliament 0.3203399 -13.42792 5.918647
# Getting regression coefficient separately for each country, new way, in
# three formats:
library(plyr)
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()
##
## country (Intercept) left.parliament
## Australia 414.77123 -0.8638052
## Austria 423.07728 -8.2108864
## Belgium -56.92678 8.4474627
## Canada -227.82177 17.6766029
## Denmark -1399.35735 34.3447662
## Finland 108.22451 12.8422018
strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame
## country (Intercept) left.parliament
## 1 Australia 414.77123 -0.8638052
## 2 Austria 423.07728 -8.2108864
## 3 Belgium -56.92678 8.4474627
## 4 Canada -227.82177 17.6766029
## 5 Denmark -1399.35735 34.3447662
## 6 Finland 108.22451 12.8422018
strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list
## $Australia
## (Intercept) left.parliament
## 414.7712254 -0.8638052
##
## $Austria
## (Intercept) left.parliament
## 423.077279 -8.210886
##
## $Belgium
## (Intercept) left.parliament
## -56.926780 8.447463
##
## $Canada
## (Intercept) left.parliament
## -227.8218 17.6766
##
## $Denmark
## (Intercept) left.parliament
## -1399.35735 34.34477
##
## $Finland
## (Intercept) left.parliament
## 108.2245 12.8422
The function d*ply()
makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables
argument
# First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country
# pre and post 1975
strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows
## [1] 36 4
head(strikes.coefs.1975)
## country yearPre1975 (Intercept) left.parliament
## 1 Australia FALSE 973.34088 -11.8094991
## 2 Australia TRUE -169.59900 12.0170866
## 3 Austria FALSE 19.51823 -0.3470889
## 4 Austria TRUE 400.83004 -7.7051918
## 5 Belgium FALSE -4182.06650 148.0049261
## 6 Belgium TRUE -103.67439 9.5802824
# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows
## [1] 36 4
head(strikes.coefs.1975)
## country I(year <= 1975) (Intercept) left.parliament
## 1 Australia FALSE 973.34088 -11.8094991
## 2 Australia TRUE -169.59900 12.0170866
## 3 Austria FALSE 19.51823 -0.3470889
## 4 Austria TRUE 400.83004 -7.7051918
## 5 Belgium FALSE -4182.06650 148.0049261
## 6 Belgium TRUE -103.67439 9.5802824
What happens if we have a really large data set and we want to use split-apply-combine?
If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel
The plyr
functions make this quite easy: let’s take a look at the full signature for daply()
:
daply(.data, .variables, .fun = NULL, ..., .progress = "none",
.inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
.paropts = NULL)
The second to last argument .parallel
(default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach
package
The last argument .paropts
is for more advanced parallelization, these are additional arguments to be passed to foreach
For more, read the foreach
package first. May take some time to set up the parallel backend (this is often system specific)
But once set up, parallelization is simple and beautiful with **ply()
! The difference is just, e.g., daply(strikes.df, .(country), my.strike.lm)
versus daply(strikes.df, .(country), my.strike.lm, .parallel=TRUE)