Last week: Fitting models to data

Fitting models is critical to both statistical inference and prediction
Exploratory data analysis is a very good first step and gives you a sense of what you’re dealing with before you start modeling
Linear regression is the most basic modeling tool of all, and one of the most ubiquitous
lm() allows you to fit a linear model by specifying a formula, in terms of column names of a given data frame
Utility functions coef(), fitted(), residuals(), summary(), plot(), predict() are very handy and should be used over manual access tricks
Logistic regression is the natural extension of linear regression to binary data; use glm() with family="binomial" and all the same utility functions
Generalized additive models add a level of flexibility in that they allow the predictors to have nonlinear effects; use gam() and utility functions

Part I

Split-apply-combine

Reminder: iterating in R without `for()`

We’ve learned some tools in R for iteration without explicit for() loops:

Indexing with conditionals + vectorization
apply(): apply a function to rows or columns of a matrix or data frame
lapply(): apply a function to elements of a list or vector
sapply(): same as the above, but simplify the output (if possible)
tapply(): apply a function to levels of a factor vector

Split-apply-combine

Today we will learn a general strategy for data analysis/management that can be summmarized in three conceptual steps:

Split whatever data object we have into meaningful chunks
Apply the function of interest to each element in this division
Combine the results into a new object of the desired structure

These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the apply() family

Simple but powerful

Does split-apply-combine sound simple? It is, but it’s very powerful when combined with the right data structures

As usual, compared to explicit for() loops, often requires far less code
Fits nicely with our previous idea of top-down function design, simply at a broader scope
Sets you in the right direction towards learning how to use MapReduce/Hadoop for really, really big data sets

Strikes data set

Data set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:

country, year: country and year of data collection
strike.volume: days on strike per 1000 workers
unemployment: unemployment rate
inflation: inflation rate
left.parliament: leftwing share of the goverment
centralization: centralization of unions
density: density of unions

strikes.df = 
  read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries

## [1] 625   8

head(strikes.df)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

An interesting question

Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?

washington madison

How could we approach this?

Worst way: by hand, write 18 separate code blocks
Bad way: explicit for() loop, where we loop over countries
Best way: split appropriately, then use sapply()

Work with just one chunk of data

Step 0: design a function that will work on one chunk of data

So let’s write code to do regression on the data from (say) just Italy

strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)

## [1] 35  8

head(strikes.df.italy)

##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)

## 
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -930.2 -411.6 -137.3  387.2 1901.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -738.75    1200.62  -0.615    0.543
## left.parliament    40.29      27.76   1.451    0.156
## 
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared:  0.05999,    Adjusted R-squared:  0.0315 
## F-statistic: 2.106 on 1 and 33 DF,  p-value: 0.1562

plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
     main="Italy strike volume versus leftwing alignment",
     ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)

Functionalization

Now let’s turn this into a function

my.strike.lm = function(country.df) {
  coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way

##     (Intercept) left.parliament 
##      -738.74531        40.29109

coef(italy.lm) # Old way, same result

##     (Intercept) left.parliament 
##      -738.74531        40.29109

Split our data into appropriate chunks

Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function split() is often helpful: split(df, f=my.factor) splits a data frame df into several data frames, defined by constant levels of the factor my.factor

So we want to split strikes.df into 18 smaller data frames, each of which has the data for just one country

strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list

## [1] "list"

names(strikes.by.country) # It has one element for each country

##  [1] "Australia"   "Austria"     "Belgium"     "Canada"      "Denmark"    
##  [6] "Finland"     "France"      "Germany"     "Ireland"     "Italy"      
## [11] "Japan"       "Netherlands" "New.Zealand" "Norway"      "Sweden"     
## [16] "Switzerland" "UK"          "USA"

head(strikes.by.country$Italy) # Same as what we saw before

##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

Apply our function and combine the results

Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions lapply() or sapply() are often helpful

So we want to apply my.strikes.lm() to each data frame in strikes.by.country. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use sapply()

strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs

##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
     xlab="", ylab="Regression coefficient", 
     main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
     labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")

Part II

Plyr: a*ply() and l*aply()

The `plyr` package

Most popular R package of all time (most downloads): plyr. This is true for a good reason! Provides extremely useful family of apply-like functions

Advantage over the built-in apply() family is its consistency
All plyr functions are of the form **ply()
Replace ** with characters denoting types:
- First character: input type, one of a, d, l
- Second character: output type, one of a, d, l, or _ (drop)

`a*ply()`: input is an array

The signature for all a*ply() functions is:

a*ply(.data, .margins, .fun, ...)

.data : an array
.margins : index (or indices) to split the array by
.fun : the function to be applied to each piece
... : additional arguments to be passed to the function

Note that this resembles:

apply(X, MARGIN, FUN, ...)

Examples of `a*ply()`

my.array = array(1:27, c(3,3,3))
rownames(my.array) = c("Curly", "Larry", "Moe")
colnames(my.array) = c("Groucho", "Harpo", "Zeppo")
dimnames(my.array)[[3]] = c("Bart", "Lisa", "Maggie")
my.array

## , , Bart
## 
##       Groucho Harpo Zeppo
## Curly       1     4     7
## Larry       2     5     8
## Moe         3     6     9
## 
## , , Lisa
## 
##       Groucho Harpo Zeppo
## Curly      10    13    16
## Larry      11    14    17
## Moe        12    15    18
## 
## , , Maggie
## 
##       Groucho Harpo Zeppo
## Curly      19    22    25
## Larry      20    23    26
## Moe        21    24    27

library(plyr)
aaply(my.array, 1, sum) # Get back an array

## Curly Larry   Moe 
##   117   126   135

adply(my.array, 1, sum) # Get back a data frame

##      X1  V1
## 1 Curly 117
## 2 Larry 126
## 3   Moe 135

alply(my.array, 1, sum) # Get back a list

## $`1`
## [1] 117
## 
## $`2`
## [1] 126
## 
## $`3`
## [1] 135
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##      X1
## 1 Curly
## 2 Larry
## 3   Moe

aaply(my.array, 2:3, sum) # Get back a 3 x 3 array

##          X2
## X1        Bart Lisa Maggie
##   Groucho    6   33     60
##   Harpo     15   42     69
##   Zeppo     24   51     78

adply(my.array, 2:3, sum) # Get back a data frame

##        X1     X2 V1
## 1 Groucho   Bart  6
## 2   Harpo   Bart 15
## 3   Zeppo   Bart 24
## 4 Groucho   Lisa 33
## 5   Harpo   Lisa 42
## 6   Zeppo   Lisa 51
## 7 Groucho Maggie 60
## 8   Harpo Maggie 69
## 9   Zeppo Maggie 78

alply(my.array, 2:3, sum) # Get back a list

## $`1`
## [1] 6
## 
## $`2`
## [1] 15
## 
## $`3`
## [1] 24
## 
## $`4`
## [1] 33
## 
## $`5`
## [1] 42
## 
## $`6`
## [1] 51
## 
## $`7`
## [1] 60
## 
## $`8`
## [1] 69
## 
## $`9`
## [1] 78
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##        X1     X2
## 1 Groucho   Bart
## 2   Harpo   Bart
## 3   Zeppo   Bart
## 4 Groucho   Lisa
## 5   Harpo   Lisa
## 6   Zeppo   Lisa
## 7 Groucho Maggie
## 8   Harpo Maggie
## 9   Zeppo Maggie

`l*ply()`: input is a list

The signature for all l*ply() functions is:

l*ply(.data, .fun, ...)

.data : a list
.fun : the function to be applied to each element
... : additional arguments to be passed to the function

Note that this resembles:

lapply(X, FUN, ...)

Examples of `l*ply()`

my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back an array

##      1                   2                 
## [1,] "-2.94398824455701" "2.88538944857663"
## [2,] "a"                 "z"               
## [3,] "365"               "21198"

ldply(my.list, range) # Get back a data frame

##    .id                V1               V2
## 1 nums -2.94398824455701 2.88538944857663
## 2 lets                 a                z
## 3 pops               365            21198

llply(my.list, range) # Get back a list

## $nums
## [1] -2.943988  2.885389
## 
## $lets
## [1] "a" "z"
## 
## $pops
## [1]   365 21198

laply(my.list, summary) # Doesn't work! Outputs have different types/lengths

## Error: Results must have one or more dimensions.

ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths

## Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor): Results do not have equal lengths

llply(my.list, summary) # Works just fine

## $nums
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.943988 -0.624965 -0.009705  0.012965  0.687484  2.885389 
## 
## $lets
##    Length     Class      Mode 
##        26 character character 
## 
## $pops
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     365    1080    2838    4246    4968   21198

The fourth option for `*`

The fourth option for * is _: the function a_ply() (or l*ply()) has no explicit return object, but still runs the given function over the given array (or list), possibly producing side effects

par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(my.array, 2:3, plot, ylim=range(my.array), pch=19, col=6)

Part III

Plyr: d*ply()

`d*ply()` : the input is a data frame

The signature for all d*ply() functions is:

d*ply(.data, .variables, .fun, ...)

.data : a data frame
.variables : variable (or variables) to split the data frame by
.fun : the function to be applied to each piece
... : additional arguments to be passed to the function

Note that this resembles:

tapply(X, INDEX, FUN, ...)

Strikes data set, revisited

# Regression coefficients separately for each country, old way:
strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)

##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

# Getting regression coefficient separately for each country, new way:
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()

##            
## country     (Intercept) left.parliament
##   Australia   414.77123      -0.8638052
##   Austria     423.07728      -8.2108864
##   Belgium     -56.92678       8.4474627
##   Canada     -227.82177      17.6766029
##   Denmark   -1399.35735      34.3447662
##   Finland     108.22451      12.8422018

strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame

##     country (Intercept) left.parliament
## 1 Australia   414.77123      -0.8638052
## 2   Austria   423.07728      -8.2108864
## 3   Belgium   -56.92678       8.4474627
## 4    Canada  -227.82177      17.6766029
## 5   Denmark -1399.35735      34.3447662
## 6   Finland   108.22451      12.8422018

strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list

## $Australia
##     (Intercept) left.parliament 
##     414.7712254      -0.8638052 
## 
## $Austria
##     (Intercept) left.parliament 
##      423.077279       -8.210886 
## 
## $Belgium
##     (Intercept) left.parliament 
##      -56.926780        8.447463 
## 
## $Canada
##     (Intercept) left.parliament 
##       -227.8218         17.6766 
## 
## $Denmark
##     (Intercept) left.parliament 
##     -1399.35735        34.34477 
## 
## $Finland
##     (Intercept) left.parliament 
##        108.2245         12.8422

Splitting on two (or more) variables

The function d*ply() makes it very easy to split on two (or more) variables: we just specify them, separated by a “,” in the .variables argument

# First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country 
# pre and post 1975
strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows

## [1] 36  4

head(strikes.coefs.1975)

##     country yearPre1975 (Intercept) left.parliament
## 1 Australia       FALSE   973.34088     -11.8094991
## 2 Australia        TRUE  -169.59900      12.0170866
## 3   Austria       FALSE    19.51823      -0.3470889
## 4   Austria        TRUE   400.83004      -7.7051918
## 5   Belgium       FALSE -4182.06650     148.0049261
## 6   Belgium        TRUE  -103.67439       9.5802824

# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows

## [1] 36  4

head(strikes.coefs.1975)

##     country I(year <= 1975) (Intercept) left.parliament
## 1 Australia           FALSE   973.34088     -11.8094991
## 2 Australia            TRUE  -169.59900      12.0170866
## 3   Austria           FALSE    19.51823      -0.3470889
## 4   Austria            TRUE   400.83004      -7.7051918
## 5   Belgium           FALSE -4182.06650     148.0049261
## 6   Belgium            TRUE  -103.67439       9.5802824

Parallelization

What happens if we have a really large data set and we want to use split-apply-combine?

If the individual tasks are unrelated, then we should be speed up the computation by performing them in parallel

The plyr functions make this quite easy: let’s take a look at the full signature for daply():

daply(.data, .variables, .fun = NULL, ..., .progress = "none",
  .inform = FALSE, .drop_i = TRUE, .drop_o = TRUE, .parallel = FALSE,
  .paropts = NULL)

The second to last argument .parallel (default FALSE) is for parallelization. If set to TRUE, then it performs the individual tasks in parallel, using the foreach package

The last argument .paropts is for more advanced parallelization, these are additional arguments to be passed to foreach

For more, read the foreach package first. May take some time to set up the parallel backend (this is often system specific)

But once set up, parallelization is simple and beautiful with **ply()! The difference is just, e.g.,

daply(strikes.df, .(country), my.strike.lm)

versus

daply(strikes.df, .(country), my.strike.lm, .parallel=TRUE)

Summary

The split-apply-combine framework is very helpful for breaking up large data analysis tasks, on ragged data, into manageable chunks
- Step 0: design a function to work on one chunk
- Step 1: split your data into appropriate chunks
- Steps 2 and 3: apply your function on each chunk and combine the results (these steps can often be done together)
Apply functions are very important in the context of the split-apply-combine paradigm
The plyr package makes life easier by making the input and outputs types (array, data frame, or list?) as explicit as possible
- a*ply(): input is an array, * specifies output type
- l*ply(): input is a list, * specifies output type
- d*ply(): input is a data frame, * specifies output type

Split-Apply-Combine and Plyr

Last week: Fitting models to data

Part I

Reminder: iterating in R without for()

Split-apply-combine

Simple but powerful

Strikes data set

An interesting question

Work with just one chunk of data

Functionalization

Split our data into appropriate chunks

Apply our function and combine the results

Part II

The plyr package

a*ply(): input is an array

Examples of a*ply()

l*ply(): input is a list

Examples of l*ply()

The fourth option for *

Part III

d*ply() : the input is a data frame

Strikes data set, revisited

Splitting on two (or more) variables

Parallelization

Summary

Reminder: iterating in R without `for()`

The `plyr` package

`a*ply()`: input is an array

Examples of `a*ply()`

`l*ply()`: input is a list

Examples of `l*ply()`

The fourth option for `*`

`d*ply()` : the input is a data frame