lm()
allows you to fit a linear model by specifying a formula, in terms of column names of a given data framecoef()
, fitted()
, residuals()
, summary()
, plot()
, predict()
are very handy and should be used over manual access tricksglm()
with family="binomial"
and all the same utility functionsgam()
and utility functionsSplit-apply-combine
for()
We’ve learned some tools in R for iteration without explicit for()
loops:
apply()
: apply a function to rows or columns of a matrix or data framelapply()
: apply a function to elements of a list or vectorsapply()
: same as the above, but simplify the output (if possible)tapply()
: apply a function to levels of a factor vectorToday we will learn a general strategy for data analysis/management that can be summmarized in three conceptual steps:
These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the apply()
family
Does split-apply-combine sound simple? It is, but it’s very powerful when combined with the right data structures
for()
loops, often requires far less codeData set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:
country
, year
: country and year of data collectionstrike.volume
: days on strike per 1000 workersunemployment
: unemployment rateinflation
: inflation rateleft.parliament
: leftwing share of the govermentcentralization
: centralization of unionsdensity
: density of unionsstrikes.df =
read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries
## [1] 625 8
head(strikes.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?
How could we approach this?
for()
loop, where we loop over countriessapply()
Step 0: design a function that will work on one chunk of data
So let’s write code to do regression on the data from (say) just Italy
strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)
## [1] 35 8
head(strikes.df.italy)
## country year strike.volume unemployment inflation left.parliament
## 311 Italy 1951 437 8.8 14.3 37.5
## 312 Italy 1952 337 9.5 1.9 37.5
## 313 Italy 1953 545 10.0 1.4 40.2
## 314 Italy 1954 493 8.7 2.4 40.2
## 315 Italy 1955 511 7.5 2.3 40.2
## 316 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 311 0.2513799 NA
## 312 0.2489860 NA
## 313 0.2482739 NA
## 314 0.2466577 NA
## 315 0.2540366 NA
## 316 0.2457069 NA
italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)
##
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -930.2 -411.6 -137.3 387.2 1901.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -738.75 1200.62 -0.615 0.543
## left.parliament 40.29 27.76 1.451 0.156
##
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared: 0.05999, Adjusted R-squared: 0.0315
## F-statistic: 2.106 on 1 and 33 DF, p-value: 0.1562
plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
main="Italy strike volume versus leftwing alignment",
ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)
Now let’s turn this into a function
my.strike.lm = function(country.df) {
coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way
## (Intercept) left.parliament
## -738.74531 40.29109
coef(italy.lm) # Old way, same result
## (Intercept) left.parliament
## -738.74531 40.29109
Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function split()
is often helpful: split(df, f=my.factor)
splits a data frame df
into several data frames, defined by constant levels of the factor my.factor
So we want to split strikes.df
into 18 smaller data frames, each of which has the data for just one country
strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list
## [1] "list"
names(strikes.by.country) # It has one element for each country
## [1] "Australia" "Austria" "Belgium" "Canada" "Denmark"
## [6] "Finland" "France" "Germany" "Ireland" "Italy"
## [11] "Japan" "Netherlands" "New.Zealand" "Norway" "Sweden"
## [16] "Switzerland" "UK" "USA"
head(strikes.by.country$Italy) # Same as what we saw before
## country year strike.volume unemployment inflation left.parliament
## 311 Italy 1951 437 8.8 14.3 37.5
## 312 Italy 1952 337 9.5 1.9 37.5
## 313 Italy 1953 545 10.0 1.4 40.2
## 314 Italy 1954 493 8.7 2.4 40.2
## 315 Italy 1955 511 7.5 2.3 40.2
## 316 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 311 0.2513799 NA
## 312 0.2489860 NA
## 313 0.2482739 NA
## 314 0.2466577 NA
## 315 0.2540366 NA
## 316 0.2457069 NA
Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions lapply()
or sapply()
are often helpful
So we want to apply my.strikes.lm()
to each data frame in strikes.by.country
. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use sapply()
strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs
## Australia Austria Belgium Canada Denmark
## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
## Finland France Germany Ireland Italy
## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109
## Japan Netherlands New.Zealand Norway Sweden
## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704
## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072
## Switzerland UK USA
## (Intercept) -5.1988836 936.10154 111.440651
## left.parliament 0.3203399 -13.42792 5.918647
# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
xlab="", ylab="Regression coefficient",
main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")
Plyr: a*ply()
and l*aply()
plyr
packageMost popular R package of all time (most downloads): plyr
. This is true for a good reason! Provides extremely useful family of apply-like functions
apply()
family is its consistencyplyr
functions are of the form **ply()
**
with characters denoting types:
a
, d
, l
a
, d
, l
, or _
(drop)a*ply()
: input is an arrayThe signature for all a*ply()
functions is:
a*ply(.data, .margins, .fun, ...)
.data
: an array.margins
: index (or indices) to split the array by.fun
: the function to be applied to each piece...
: additional arguments to be passed to the functionNote that this resembles:
apply(X, MARGIN, FUN, ...)
a*ply()
my.array = array(1:27, c(3,3,3))
rownames(my.array) = c("Curly", "Larry", "Moe")
colnames(my.array) = c("Groucho", "Harpo", "Zeppo")
dimnames(my.array)[[3]] = c("Bart", "Lisa", "Maggie")
my.array
## , , Bart
##
## Groucho Harpo Zeppo
## Curly 1 4 7
## Larry 2 5 8
## Moe 3 6 9
##
## , , Lisa
##
## Groucho Harpo Zeppo
## Curly 10 13 16
## Larry 11 14 17
## Moe 12 15 18
##
## , , Maggie
##
## Groucho Harpo Zeppo
## Curly 19 22 25
## Larry 20 23 26
## Moe 21 24 27
library(plyr)
aaply(my.array, 1, sum) # Get back an array
## Curly Larry Moe
## 117 126 135
adply(my.array, 1, sum) # Get back a data frame
## X1 V1
## 1 Curly 117
## 2 Larry 126
## 3 Moe 135
alply(my.array, 1, sum) # Get back a list
## $`1`
## [1] 117
##
## $`2`
## [1] 126
##
## $`3`
## [1] 135
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 Curly
## 2 Larry
## 3 Moe
aaply(my.array, 2:3, sum) # Get back a 3 x 3 array
## X2
## X1 Bart Lisa Maggie
## Groucho 6 33 60
## Harpo 15 42 69
## Zeppo 24 51 78
adply(my.array, 2:3, sum) # Get back a data frame
## X1 X2 V1
## 1 Groucho Bart 6
## 2 Harpo Bart 15
## 3 Zeppo Bart 24
## 4 Groucho Lisa 33
## 5 Harpo Lisa 42
## 6 Zeppo Lisa 51
## 7 Groucho Maggie 60
## 8 Harpo Maggie 69
## 9 Zeppo Maggie 78
alply(my.array, 2:3, sum) # Get back a list
## $`1`
## [1] 6
##
## $`2`
## [1] 15
##
## $`3`
## [1] 24
##
## $`4`
## [1] 33
##
## $`5`
## [1] 42
##
## $`6`
## [1] 51
##
## $`7`
## [1] 60
##
## $`8`
## [1] 69
##
## $`9`
## [1] 78
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1 X2
## 1 Groucho Bart
## 2 Harpo Bart
## 3 Zeppo Bart
## 4 Groucho Lisa
## 5 Harpo Lisa
## 6 Zeppo Lisa
## 7 Groucho Maggie
## 8 Harpo Maggie
## 9 Zeppo Maggie
l*ply()
: input is a listThe signature for all l*ply()
functions is:
l*ply(.data, .fun, ...)
.data
: a list.fun
: the function to be applied to each element...
: additional arguments to be passed to the functionNote that this resembles:
lapply(X, FUN, ...)
l*ply()
my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back an array
## 1 2
## [1,] "-3.07576742494402" "3.26875694312028"
## [2,] "a" "z"
## [3,] "365" "21198"
ldply(my.list, range) # Get back a data frame
## .id V1 V2
## 1 nums -3.07576742494402 3.26875694312028
## 2 lets a z
## 3 pops 365 21198
llply(my.list, range) # Get back a list
## $nums
## [1] -3.075767 3.268757
##
## $lets
## [1] "a" "z"
##
## $pops
## [1] 365 21198
laply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error: Results must have one or more dimensions.
ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths
## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths
llply(my.list, summary) # Works just fine
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.07577 -0.64906 0.02912 0.03152 0.71456 3.26876
##
## $lets
## Length Class Mode
## 26 character character
##
## $pops
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 365 1080 2838 4246 4968 21198
*
The fourth option for *
is _
: the function a_ply()
(or l*ply()
) has no explicit return object, but still runs the given function over the given array (or list), possibly producing side effects
par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(my.array, 2:3, plot, ylim=range(my.array), pch=19, col=6)
Plyr: d*ply()
d*ply()
: the input is a data frameThe signature for all d*ply()
functions is:
d*ply(.data, .variables, .fun, ...)
.data
: a data frame.variables
: variable (or variables) to split the data frame by.fun
: the function to be applied to each piece...
: additional arguments to be passed to the functionNote that this resembles:
tapply(X, INDEX, FUN, ...)
# Regression coefficients separately for each country, old way:
strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)
## Australia Austria Belgium Canada Denmark
## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
## Finland France Germany Ireland Italy
## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109
## Japan Netherlands New.Zealand Norway Sweden
## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704
## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072
## Switzerland UK USA
## (Intercept) -5.1988836 936.10154 111.440651
## left.parliament 0.3203399 -13.42792 5.918647
# Getting regression coefficient separately for each country, new way:
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()
##
## country (Intercept) left.parliament
## Australia 414.77123 -0.8638052
## Austria 423.07728 -8.2108864
## Belgium -56.92678 8.4474627
## Canada -227.82177 17.6766029
## Denmark -1399.35735 34.3447662
## Finland 108.22451 12.8422018
strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame
## country (Intercept) left.parliament
## 1 Australia 414.77123 -0.8638052
## 2 Austria 423.07728 -8.2108864
## 3 Belgium -56.92678 8.4474627
## 4 Canada -227.82177 17.6766029
## 5 Denmark -1399.35735 34.3447662
## 6 Finland 108.22451 12.8422018
strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list
## $Australia
## (Intercept) left.parliament
## 414.7712254 -0.8638052
##
## $Austria
## (Intercept) left.parliament
## 423.077279 -8.210886
##
## $Belgium
## (Intercept) left.parliament
## -56.926780 8.447463
##
## $Canada
## (Intercept) left.parliament
## -227.8218 17.6766
##
## $Denmark
## (Intercept) left.parliament
## -1399.35735 34.34477
##
## $Finland
## (Intercept) left.parliament
## 108.2245 12.8422
The function d*ply()
makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables
argument
# First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country
# pre and post 1975
strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows
## [1] 36 4
head(strikes.coefs.1975)
## country yearPre1975 (Intercept) left.parliament
## 1 Australia FALSE 973.34088 -11.8094991
## 2 Australia TRUE -169.59900 12.0170866
## 3 Austria FALSE 19.51823 -0.3470889
## 4 Austria TRUE 400.83004 -7.7051918
## 5 Belgium FALSE -4182.06650 148.0049261
## 6 Belgium TRUE -103.67439 9.5802824
# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows
## [1] 36 4
head(strikes.coefs.1975)
## country I(year <= 1975) (Intercept) left.parliament
## 1 Australia FALSE 973.34088 -11.8094991
## 2 Australia TRUE -169.59900 12.0170866
## 3 Austria FALSE 19.51823 -0.3470889
## 4 Austria TRUE 400.83004 -7.7051918
## 5 Belgium FALSE -4182.06650 148.0049261
## 6 Belgium TRUE -103.67439 9.5802824
What happens if we have a really large data set and we want to use split-apply-combine?
If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel
The plyr
functions make this quite easy: let’s take a look at the full signature for daply()
:
daply(.data, .variables, .fun = NULL, ..., .progress = "none",
.inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
.paropts = NULL)
The second to last argument .parallel
(default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach
package
The last argument .paropts
is for more advanced parallelization, these are additional arguments to be passed to foreach
For more, read the foreach
package first. May take some time to set up the parallel backend (this is often system specific)
But once set up, parallelization is simple and beautiful with **ply()
! The difference is just, e.g.,
daply(strikes.df, .(country), my.strike.lm)
versus
daply(strikes.df, .(country), my.strike.lm, .parallel=TRUE)
plyr
package makes life easier by making the input and outputs types (array, data frame, or list?) as explicit as possible
a*ply()
: input is an array, *
specifies output typel*ply()
: input is a list, *
specifies output typed*ply()
: input is a data frame, *
specifies output type