plyr
packageOne of the most downloaded R package of all time: plyr
. This is true for a good reason! Provides extremely useful family of apply-like functions
apply()
family is its consistencyplyr
functions are of the form **ply()
**
with characters denoting types:
a
, d
, l
a
, d
, l
, or _
(drop)a*ply()
: input is an arrayThe signature for all a*ply()
functions is:
a*ply(.data, .margins, .fun, ...)
.data
: an array.margins
: index (or indices) to split the array by.fun
: the function to be applied to each piece...
: additional arguments to be passed to the functionNote that this resembles:
apply(X, MARGIN, FUN, ...)
a*ply()
library(plyr)
head(aaply(state.x77, 1, mean)) # Get back array
## Alabama Alaska Arizona Arkansas California Colorado
## 7261.819 71676.601 15039.031 7202.570 22854.839 13937.558
head(adply(state.x77, 1, mean)) # Get back data frame
## X1 V1
## 1 Alabama 7261.819
## 2 Alaska 71676.601
## 3 Arizona 15039.031
## 4 Arkansas 7202.570
## 5 California 22854.839
## 6 Colorado 13937.558
head(alply(state.x77, 1, mean)) # Get back list
## $`1`
## [1] 7261.819
##
## $`2`
## [1] 71676.6
##
## $`3`
## [1] 15039.03
##
## $`4`
## [1] 7202.57
##
## $`5`
## [1] 22854.84
##
## $`6`
## [1] 13937.56
mean.sd = function(x) c("mean"=mean(x), "sd"=sd(x))
head(aaply(state.x77, 1, mean.sd)) # Get back array
##
## X1 mean sd
## Alabama 7261.819 17629.67
## Alaska 71676.601 199923.19
## Arizona 15039.031 39784.17
## Arkansas 7202.570 18123.15
## California 22854.839 54439.39
## Colorado 13937.558 36339.05
head(adply(state.x77, 1, mean.sd)) # Get back data frame
## X1 mean sd
## 1 Alabama 7261.819 17629.67
## 2 Alaska 71676.601 199923.19
## 3 Arizona 15039.031 39784.17
## 4 Arkansas 7202.570 18123.15
## 5 California 22854.839 54439.39
## 6 Colorado 13937.558 36339.05
head(alply(state.x77, 1, mean.sd)) # Get back list
## $`1`
## mean sd
## 7261.819 17629.674
##
## $`2`
## mean sd
## 71676.6 199923.2
##
## $`3`
## mean sd
## 15039.03 39784.17
##
## $`4`
## mean sd
## 7202.57 18123.15
##
## $`5`
## mean sd
## 22854.84 54439.39
##
## $`6`
## mean sd
## 13937.56 36339.05
l*ply()
: input is a listThe signature for all l*ply()
functions is:
l*ply(.data, .fun, ...)
.data
: a list.fun
: the function to be applied to each element...
: additional arguments to be passed to the functionNote that this resembles:
lapply(X, FUN, ...)
l*ply()
my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back array
## 1 2
## [1,] "-3.01216378355869" "3.54114027762577"
## [2,] "a" "z"
## [3,] "365" "21198"
ldply(my.list, range) # Get back data frame
## .id V1 V2
## 1 nums -3.01216378355869 3.54114027762577
## 2 lets a z
## 3 pops 365 21198
llply(my.list, range) # Get back list
## $nums
## [1] -3.012164 3.541140
##
## $lets
## [1] "a" "z"
##
## $pops
## [1] 365 21198
laply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error: Results must have one or more dimensions.
ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths
llply(my.list, summary) # Works just fine
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.012000 -0.677000 -0.003001 0.015490 0.733500 3.541000
##
## $lets
## Length Class Mode
## 26 character character
##
## $pops
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 365 1080 2838 4246 4968 21200
d*ply()
: the input is a data frameThe signature for all d*ply()
functions is:
d*ply(.data, .variables, .fun, ...)
.data
: a data frame.variables
: variable (or variables) to split the data frame by.fun
: the function to be applied to each piece...
: additional arguments to be passed to the functionNote that this resembles:
tapply(X, INDEX, FUN, ...)
d*ply()
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
daply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back array
##
## Region mean sd
## Northeast 132.7778 30.89408
## South 64.6250 31.30682
## North Central 138.8333 23.89307
## West 102.1538 68.87652
ddply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back df
## Region mean sd
## 1 Northeast 132.7778 30.89408
## 2 South 64.6250 31.30682
## 3 North Central 138.8333 23.89307
## 4 West 102.1538 68.87652
dlply(state.df, .(Region), function(df) mean.sd(df$Frost)) # Get back list
## $Northeast
## mean sd
## 132.77778 30.89408
##
## $South
## mean sd
## 64.62500 31.30682
##
## $`North Central`
## mean sd
## 138.83333 23.89307
##
## $West
## mean sd
## 102.15385 68.87652
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## Region
## 1 Northeast
## 2 South
## 3 North Central
## 4 West
The function d*ply()
makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables
argument
# First create a variable that indicates whether the area is big or not
state.df$AreaBig = state.df$Area > 50000
# Now use (say) ddply() to compute the mean and sd Frost, for each region, but
# separately over big and small areas
ddply(state.df, .(Region, AreaBig), function(df) mean.sd(df$Frost))
## Region AreaBig mean sd
## 1 Northeast FALSE 132.7778 30.894084
## 2 South FALSE 76.1000 28.512960
## 3 South TRUE 45.5000 27.833433
## 4 North Central FALSE 123.0000 1.414214
## 5 North Central TRUE 142.0000 25.113078
## 6 West FALSE 0.0000 NA
## 7 West TRUE 110.6667 64.401205
# We can also create factor variables on-the-fly with I()
ddply(state.df, .(Region, I(Area > 50000)), function(df) mean.sd(df$Frost))
## Region I(Area > 50000) mean sd
## 1 Northeast FALSE 132.7778 30.894084
## 2 South FALSE 76.1000 28.512960
## 3 South TRUE 45.5000 27.833433
## 4 North Central FALSE 123.0000 1.414214
## 5 North Central TRUE 142.0000 25.113078
## 6 West FALSE 0.0000 NA
## 7 West TRUE 110.6667 64.401205
*
The fourth option for *
is _
: the function a_ply()
(or l_ply()
or d_ply()
) has no explicit return object, but still runs the given function over the given array (or list), possibly producing side effects
par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(state.x77, 2, hist, breaks=30, col="pink")
What happens if we have a really large data set and we want to use split-apply-combine?
If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel
The plyr
functions make this quite easy: let’s take a look at the full signature for daply()
:
daply(.data, .variables, .fun = NULL, ..., .progress = "none",
.inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
.paropts = NULL)
The second to last argument .parallel
(default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach
package
The last argument .paropts
is for more advanced parallelization, these are additional arguments to be passed to foreach