Statistical Computing, 36-350
Monday November 9, 2016
for()
We’ve learned some tools in R for iteration without explicit for()
loops:
apply()
: apply a function to rows or columns of a matrix or data framelapply()
: apply a function to elements of a list or vectorsapply()
: same as the above, but simplify the output (if possible)tapply()
: apply a function to levels of a factor vectorToday we will learn a general strategy that can be summmarized in three conceptual steps:
These are conceptual steps; often the apply and combine steps can be performed for us by a single call to the appropriate function from the apply()
family
Does split-apply-combine sound simple? It is, but it’s very powerful when combined with the right data structures
for()
loops, often requires far less codeData set on 18 countries over 35 years (compiled by Bruce Western, in the Sociology Department at Harvard University). The measured variables:
country
, year
: country and year of data collectionstrike.volume
: days on strike per 1000 workersunemployment
: unemployment rateinflation
: inflation rateleft.parliament
: leftwing share of the govermentcentralization
: centralization of unionsdensity
: density of unionsstrikes.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
dim(strikes.df) # Since 18 × 35 = 630, some years missing from some countries
## [1] 625 8
head(strikes.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
Is there a relationship between a country’s ruling party alignment (left versus right) and the volume of strikes?
How could we approach this?
for()
loop, where we loop over countriessapply()
Step 0: design a function that will work on one chunk of data
So let’s write code to do regression on the data from (say) just Italy
strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for Italy
dim(strikes.df.italy)
## [1] 35 8
head(strikes.df.italy)
## country year strike.volume unemployment inflation left.parliament
## 311 Italy 1951 437 8.8 14.3 37.5
## 312 Italy 1952 337 9.5 1.9 37.5
## 313 Italy 1953 545 10.0 1.4 40.2
## 314 Italy 1954 493 8.7 2.4 40.2
## 315 Italy 1955 511 7.5 2.3 40.2
## 316 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 311 0.2513799 NA
## 312 0.2489860 NA
## 313 0.2482739 NA
## 314 0.2466577 NA
## 315 0.2540366 NA
## 316 0.2457069 NA
italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)
##
## Call:
## lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -930.2 -411.6 -137.3 387.2 1901.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -738.75 1200.62 -0.615 0.543
## left.parliament 40.29 27.76 1.451 0.156
##
## Residual standard error: 583.3 on 33 degrees of freedom
## Multiple R-squared: 0.05999, Adjusted R-squared: 0.0315
## F-statistic: 2.106 on 1 and 33 DF, p-value: 0.1562
plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
main="Italy strike volume versus leftwing alignment",
ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)
Now let’s turn this into a function
my.strike.lm = function(country.df) {
coef(lm(strike.volume ~ left.parliament, data=country.df))
}
my.strike.lm(strikes.df.italy) # New way
## (Intercept) left.parliament
## -738.74531 40.29109
coef(italy.lm) # Old way, same result
## (Intercept) left.parliament
## -738.74531 40.29109
Step 1: split our data into appropriate chunks, each of which can be handled by our function. Here, the function split()
is often helpful: split(df, f=my.factor)
splits a data frame df
into several data frames, defined by constant levels of the factor my.factor
So we want to split strikes.df
into 18 smaller data frames, each of which has the data for just one country
strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country) # It's a list
## [1] "list"
names(strikes.by.country) # It has one element for each country
## [1] "Australia" "Austria" "Belgium" "Canada" "Denmark"
## [6] "Finland" "France" "Germany" "Ireland" "Italy"
## [11] "Japan" "Netherlands" "New.Zealand" "Norway" "Sweden"
## [16] "Switzerland" "UK" "USA"
head(strikes.by.country$Italy) # Same as what we saw before
## country year strike.volume unemployment inflation left.parliament
## 311 Italy 1951 437 8.8 14.3 37.5
## 312 Italy 1952 337 9.5 1.9 37.5
## 313 Italy 1953 545 10.0 1.4 40.2
## 314 Italy 1954 493 8.7 2.4 40.2
## 315 Italy 1955 511 7.5 2.3 40.2
## 316 Italy 1956 372 9.3 3.4 40.2
## centralization density
## 311 0.2513799 NA
## 312 0.2489860 NA
## 313 0.2482739 NA
## 314 0.2466577 NA
## 315 0.2540366 NA
## 316 0.2457069 NA
Steps 2 and 3: apply our function to each chunk of data, and combine the results. Here, the functions lapply()
or sapply()
are often helpful
So we want to apply my.strikes.lm()
to each data frame in strikes.by.country
. Think about what the output will be from each function call: vector of length 2 (intercept and slope), so we can use sapply()
strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs
## Australia Austria Belgium Canada Denmark
## (Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
## left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
## Finland France Germany Ireland Italy
## (Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531
## left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109
## Japan Netherlands New.Zealand Norway Sweden
## (Intercept) 964.73750 -32.627678 721.3464 -458.22397 513.16704
## left.parliament -24.07595 1.694387 -10.0106 10.46523 -8.62072
## Switzerland UK USA
## (Intercept) -5.1988836 936.10154 111.440651
## left.parliament 0.3203399 -13.42792 5.918647
# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
xlab="", ylab="Regression coefficient",
main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")