3 November 2014

Agenda

  • A Quick Detour And Reminder of What R Can Do
  • Why We "Hate" For Loops, Part 75

Refresher

We've used some tools for iterating over objects in R without for loops:

  • subset/my.object[pick.these.rows,]: retrieve part of the data according to some condition
  • apply: takes a matrix and a margin, applies a function
  • sapply/lapply: takes a list (or vector), applies a function
  • c/rbind/cbind: concatenate these objects in a known pattern

General strategy

These are tools that, when put together, give us three general steps:

  • Split whatever data object we have into meaningful chunks
  • Apply the function of interest to this division
  • Combine the results into a new object.

Sounds simple? It is, but it's powerful when combined with data structures (see Hadoop/MapReduce for how this makes you billions)

Why Is This Useful Without The Billions?

  • This reinforces the pattern/function approach: What you want to do versus how you want to do it
  • If the full data set is big, and we've already done the splitting, this makes it tractable on smaller machines (see again: Hadoop, our well-meaning intentions for the midterm project)

An Important Application: Ragged Data

Our previous functions work well with well-composed data: apply on matrices, sapply/lapply on lists and vectors. What about ragged data – where the dimensions of each object aren't necessarily the same?

For this: Start with data frames, though we'll be going beyond this eventually.

A Sociologial Application

Politics and Labor Action: Does having a friendlier government make labor action more or less likely?

March on Washington, 1963; Wisconsin Protests, 2011

Data: Political Economy of Strikes

Compiled by Bruce Western, Sociology, Harvard.

  • Data frame of 8 columns: country, year, days on strike per 1000 workers, unemployment, inflation, left-wing share of gov’t, centralization of unions, union density
  • 625 observations from 18 countries, 1951–1985
  • Note that since 18 × 35 = 630 > 625, some years missing from some countries
strikes <- read.csv("strikes.csv")

Data: Political Economy of Strikes

head(strikes)
##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

Research Question

Does having a friendlier government make labor action more or less likely?

becomes

Is there a relationship between a country's ruling party alignment (left vs. right) and the volume of strikes?

Lots of ways to approach this problem: simplest is to split it by country.

Split functions: subset, split, by,

Take Italy:

italy.strikes <- subset(strikes, country == "Italy")

Or, if you prefer,

italy.strikes <- strikes[strikes$country == "Italy",]
head(italy.strikes)
##     country year strike.volume unemployment inflation left.parliament
## 311   Italy 1951           437          8.8      14.3            37.5
## 312   Italy 1952           337          9.5       1.9            37.5
## 313   Italy 1953           545         10.0       1.4            40.2
## 314   Italy 1954           493          8.7       2.4            40.2
## 315   Italy 1955           511          7.5       2.3            40.2
## 316   Italy 1956           372          9.3       3.4            40.2
##     centralization density
## 311      0.2513799      NA
## 312      0.2489860      NA
## 313      0.2482739      NA
## 314      0.2466577      NA
## 315      0.2540366      NA
## 316      0.2457069      NA

(No one ever says Italy)

italy.fit <- lm(strike.volume ~ left.parliament, data=italy.strikes)
plot(strike.volume ~ left.parliament, data=italy.strikes, main="Italy Strike Volume Versus Left-Wing Alignment")
abline(italy.fit, col=2)

One Down, Seventeen To Go

Tedious and dangerous to do this repeatedly – typos abound. How can we do this in an easier way?

First: We need subsets for every country. split does this nicely:

strikes.split <- split (strikes, strikes$country)
names(strikes.split)
##  [1] "Australia"   "Austria"     "Belgium"     "Canada"      "Denmark"    
##  [6] "Finland"     "France"      "Germany"     "Ireland"     "Italy"      
## [11] "Japan"       "Netherlands" "New.Zealand" "Norway"      "Sweden"     
## [16] "Switzerland" "UK"          "USA"

One Down, Seventeen To Go

Now, let's generalize our function. We want the linear model coefficients:

my.strike.lm <- function (country.df) {
  lm(strike.volume ~ left.parliament, data=country.df)$coefficients
}
my.strike.lm (subset(strikes, country == "Italy"))
##     (Intercept) left.parliament 
##      -738.74531        40.29109

One Down, Seventeen To Go

Could for-loop it:

strike.coefs <- NULL
my.countries <- c("France", "Italy", "USA")
for (this.country in my.countries) {
  strike.coefs <- cbind(strike.coefs, 
    my.strike.lm (subset(strikes, country == this.country)))  
}
colnames(strike.coefs) <- my.countries
strike.coefs
##                      France      Italy        USA
## (Intercept)     202.4261408 -738.74531 111.440651
## left.parliament  -0.4255319   40.29109   5.918647

Easier if we've split!

strike.coefs <- lapply (strikes.split[1:3], my.strike.lm)
strike.coefs
## $Australia
##     (Intercept) left.parliament 
##     414.7712254      -0.8638052 
## 
## $Austria
##     (Intercept) left.parliament 
##      423.077279       -8.210886 
## 
## $Belgium
##     (Intercept) left.parliament 
##      -56.926780        8.447463

Easier if we've split!

Combine step:

do.call(cbind, strike.coefs)
##                   Australia    Austria    Belgium
## (Intercept)     414.7712254 423.077279 -56.926780
## left.parliament  -0.8638052  -8.210886   8.447463

Or, one step:

strike.coefs <- sapply (strikes.split[1:3], my.strike.lm)
strike.coefs
##                   Australia    Austria    Belgium
## (Intercept)     414.7712254 423.077279 -56.926780
## left.parliament  -0.8638052  -8.210886   8.447463

R makes it easy!

We can do it in one line!

coefs <- simplify2array(by (strikes, strikes$country, my.strike.lm))
coefs
##                   Australia    Austria    Belgium    Canada     Denmark
## (Intercept)     414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament  -0.8638052  -8.210886   8.447463   17.6766    34.34477
##                  Finland      France   Germany   Ireland      Italy
## (Intercept)     108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament  12.8422  -0.4255319 -1.312305  55.46721   40.29109
##                     Japan Netherlands New.Zealand     Norway    Sweden
## (Intercept)     964.73750  -32.627678    721.3464 -458.22397 513.16704
## left.parliament -24.07595    1.694387    -10.0106   10.46523  -8.62072
##                 Switzerland        UK        USA
## (Intercept)      -5.1988836 936.10154 111.440651
## left.parliament   0.3203399 -13.42792   5.918647

All together now:

plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", 
     main="Countrywise Labor Activity By Left-Wing Score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
abline(h=0,col="grey")

Add some more coefficients

my.strike.lm.better <- function (country.df) {
  lm(strike.volume ~ left.parliament + unemployment + inflation, 
     data=country.df)$coefficients
}
coefs2 <- simplify2array(by (strikes, strikes$country, 
                             my.strike.lm.better))
coefs2[,1:4]
##                   Australia     Austria      Belgium    Canada
## (Intercept)     157.9191118 600.6777769 -243.4822938 167.07123
## left.parliament   0.5658674 -11.2441604   12.4516118  13.43864
## unemployment     -1.1181489 -10.9216990    0.3578217 -48.17903
## inflation        30.4666061  -0.5923972   10.2673539  27.21807

All together now:

plot(coefs[2,],xaxt="n",xlab="",ylab="Regression coefficient", 
     main="Countrywise Labor Activity By Left-Wing Score")
axis(side=1,at=seq(along=colnames(coefs)),labels=colnames(coefs),
las=2,cex.axis=0.5)
points (coefs2[2,], col="red"); abline(h=0,col="grey")

More functions:

by() might have unexpected side effects when we get to higher dimensions. Consider this real example:

print(load ("all-team-seasons-20132014.RData"))
## [1] "teamtable"
head (teamtable)
##      FAC_WIN FAC_LOSE PENL_TAKEN PENL_DRAWN HIT HIT_TAKEN CF CA FF FA SF
## MTL        4        1          2          1   3         6 12  5 10  4  4
## MTL1       3        2          0          0   0         3 11  9  7  5  6
## MTL2       6        7          2          2   9         7 12 13 11  9 10
## MTL3       5        2          2          1   5         6 21 12 16  8  9
## MTL4       2        1          3          0   0         0  2  1  2  1  2
## MTL5       3        2          2          2   0         1  3  1  1  1  1
##      SA GF GA Off Neu Def   TOI team opponent   season gcode scorediffcat
## MTL   3  1  0   2   3   0 632.0  MTL      TOR 20132014 20001            1
## MTL1  3  0  1   1   3   1 459.5  MTL      TOR 20132014 20001            2
## MTL2  8  1  0   3   4   6 641.0  MTL      TOR 20132014 20001            3
## MTL3  5  0  1   3   2   2 641.0  MTL      TOR 20132014 20001            4
## MTL4  1  0  0   2   1   0  91.0  MTL      TOR 20132014 20001            2
## MTL5  1  0  1   4   1   0 190.5  MTL      TOR 20132014 20001            3
##      gamestate home
## MTL          1    1
## MTL1         1    1
## MTL2         1    1
## MTL3         1    1
## MTL4         2    1
## MTL5         2    1
my.agg <- aggregate (teamtable[,2:7], list(teamtable$gcode), sum)
my.agg.2 <- aggregate (teamtable[,1:8], list(teamtable$gcode, teamtable$team), sum)
my.agg.3 <- by (teamtable[,1:8], list(teamtable$team, teamtable$gcode), colSums)
split.teamtable <- split (teamtable, list(teamtable$team, teamtable$gcode))