Statistical Computing, 36-350
Monday September 9, 2019
plyr packageOne of the most downloaded R package of all time: plyr. This is true for a good reason! Provides extremely useful family of apply-like functions
apply() family is its consistencyplyr functions are of the form **ply()** with characters denoting types:
a, d, la, d, l, or _ (drop)a*ply(): input is an arrayThe signature for all a*ply() functions is:
a*ply(.data, .margins, .fun, ...).data : an array.margins : index (or indices) to split the array by.fun : the function to be applied to each piece... : additional arguments to be passed to the functionNote that this resembles:
apply(X, MARGIN, FUN, ...)a*ply()library(plyr)
head(aaply(state.x77, 1, mean)) # Get back array##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##   7261.819  71676.601  15039.031   7202.570  22854.839  13937.558head(adply(state.x77, 1, mean)) # Get back data frame##           X1        V1
## 1    Alabama  7261.819
## 2     Alaska 71676.601
## 3    Arizona 15039.031
## 4   Arkansas  7202.570
## 5 California 22854.839
## 6   Colorado 13937.558head(alply(state.x77, 1, mean)) # Get back list## $`1`
## [1] 7261.819
## 
## $`2`
## [1] 71676.6
## 
## $`3`
## [1] 15039.03
## 
## $`4`
## [1] 7202.57
## 
## $`5`
## [1] 22854.84
## 
## $`6`
## [1] 13937.56mean.sd = function(x) c("mean"=mean(x), "sd"=sd(x))
head(aaply(state.x77, 1, mean.sd)) # Get back array##             
## X1                mean        sd
##   Alabama     7261.819  17629.67
##   Alaska     71676.601 199923.19
##   Arizona    15039.031  39784.17
##   Arkansas    7202.570  18123.15
##   California 22854.839  54439.39
##   Colorado   13937.558  36339.05head(adply(state.x77, 1, mean.sd)) # Get back data frame##           X1      mean        sd
## 1    Alabama  7261.819  17629.67
## 2     Alaska 71676.601 199923.19
## 3    Arizona 15039.031  39784.17
## 4   Arkansas  7202.570  18123.15
## 5 California 22854.839  54439.39
## 6   Colorado 13937.558  36339.05head(alply(state.x77, 1, mean.sd)) # Get back list## $`1`
##      mean        sd 
##  7261.819 17629.674 
## 
## $`2`
##     mean       sd 
##  71676.6 199923.2 
## 
## $`3`
##     mean       sd 
## 15039.03 39784.17 
## 
## $`4`
##     mean       sd 
##  7202.57 18123.15 
## 
## $`5`
##     mean       sd 
## 22854.84 54439.39 
## 
## $`6`
##     mean       sd 
## 13937.56 36339.05l*ply(): input is a listThe signature for all l*ply() functions is:
l*ply(.data, .fun, ...).data : a list.fun : the function to be applied to each element... : additional arguments to be passed to the functionNote that this resembles:
lapply(X, FUN, ...)l*ply()my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back array##      1                   2                 
## [1,] "-3.22949780553121" "3.81258033587609"
## [2,] "a"                 "z"               
## [3,] "365"               "21198"ldply(my.list, range) # Get back data frame##    .id                V1               V2
## 1 nums -3.22949780553121 3.81258033587609
## 2 lets                 a                z
## 3 pops               365            21198llply(my.list, range) # Get back list## $nums
## [1] -3.229498  3.812580
## 
## $lets
## [1] "a" "z"
## 
## $pops
## [1]   365 21198laply(my.list, summary) # Doesn't work! Outputs have different types/lengths## Error: Results must have one or more dimensions.ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengthsllply(my.list, summary) # Works just fine## $nums
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.229000 -0.696300  0.007416  0.013390  0.744600  3.813000 
## 
## $lets
##    Length     Class      Mode 
##        26 character character 
## 
## $pops
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     365    1080    2838    4246    4968   21200d*ply() : the input is a data frameThe signature for all d*ply() functions is:
d*ply(.data, .variables, .fun, ...).data : a data frame.variables : variable (or variables) to split the data frame by.fun : the function to be applied to each piece... : additional arguments to be passed to the functionNote that this resembles:
tapply(X, INDEX, FUN, ...)d*ply()state.df = data.frame(state.x77, Region=state.region, Division=state.division)
daply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back array##                
## Region              mean       sd
##   Northeast     132.7778 30.89408
##   South          64.6250 31.30682
##   North Central 138.8333 23.89307
##   West          102.1538 68.87652ddply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back df##          Region     mean       sd
## 1     Northeast 132.7778 30.89408
## 2         South  64.6250 31.30682
## 3 North Central 138.8333 23.89307
## 4          West 102.1538 68.87652dlply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back list## $Northeast
##      mean        sd 
## 132.77778  30.89408 
## 
## $South
##     mean       sd 
## 64.62500 31.30682 
## 
## $`North Central`
##      mean        sd 
## 138.83333  23.89307 
## 
## $West
##      mean        sd 
## 102.15385  68.87652 
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##          Region
## 1     Northeast
## 2         South
## 3 North Central
## 4          WestThe function d*ply() makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables argument
# First create a variable that indicates whether the area is big or not
state.df$AreaBig = state.df$Area > 50000
# Now use (say) ddply() to compute the mean and sd Frost, for each region, but
# separately over big and small areas
ddply(state.df, .(Region, AreaBig), function(df) mean.sd(df$Frost))##          Region AreaBig     mean        sd
## 1     Northeast   FALSE 132.7778 30.894084
## 2         South   FALSE  76.1000 28.512960
## 3         South    TRUE  45.5000 27.833433
## 4 North Central   FALSE 123.0000  1.414214
## 5 North Central    TRUE 142.0000 25.113078
## 6          West   FALSE   0.0000        NA
## 7          West    TRUE 110.6667 64.401205# We can also create factor variables on-the-fly with I() 
ddply(state.df, .(Region, I(Area > 50000)), function(df) mean.sd(df$Frost))##          Region I(Area > 50000)     mean        sd
## 1     Northeast           FALSE 132.7778 30.894084
## 2         South           FALSE  76.1000 28.512960
## 3         South            TRUE  45.5000 27.833433
## 4 North Central           FALSE 123.0000  1.414214
## 5 North Central            TRUE 142.0000 25.113078
## 6          West           FALSE   0.0000        NA
## 7          West            TRUE 110.6667 64.401205*The fourth option for * is _: the function a_ply() (or l_ply() or d_ply()) has no explicit return object, but still runs the given function over the given array (or list), possibly producing side effects
par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(state.x77, 2, hist, breaks=30, col="pink")What happens if we have a really large data set and we want to use split-apply-combine?
If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel
The plyr functions make this quite easy: let’s take a look at the full signature for daply():
daply(.data, .variables, .fun = NULL, ..., .progress = "none",
  .inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
  .paropts = NULL)The second to last argument .parallel (default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach package
The last argument .paropts is for more advanced parallelization, these are additional arguments to be passed to foreach