Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.

This week’s agenda: practicing split-apply-combine, getting familiar with plyr functions.

Strikes data set

Data on the political economy of strikes (from Bruce Western, in the Sociology Department at Harvard University) is up at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv. The data features 18 countries of 35 years. The measured variables:

country, year: country and year of data collection
strike.volume: days on strike per 1000 workers
unemployment: unemployment rate
inflation: inflation rate
left.parliament: leftwing share of the goverment
centralization: centralization of unions
density: density of unions

We read it into our R session below.

strikes.df = 
  read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
head(strikes.df, 3)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8              43
## 2 Australia 1952           397          2.2      17.2              43
## 3 Australia 1953           360          2.5       4.3              43
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA

Splitting by country

1a. Split strikes.df by country, using the split() function. Call the resulting list strikes.by.country, and display the names of elements the list, as well as the first 3 rows of the data frame for Canada.
1b. Using strikes.by.country and sapply(), compute the average unemployment rate for each country. What country has the highest average unemployment rate? The lowest?
1c. Using strikes.by.country and sapply(), compute a summary (min, quartiles, max) of the unemployment rate for each country. Display the output matrix; do its dimensions make sense to you?
1d. Using strikes.by.country and just one call to sapply(), compute the average unemployment rate, inflation rates, and strike volume for each country. The output should be a matrix of dimension 3 x 18; display it. Challenge: with just the one call to sapply(), figure out how to make the output matrix have appropriate row names (of your choosing).

Splitting by year

2a. Using split() and sapply(), compute the average unemployment rate, inflation rates, and strike volume for each year in the strikes.df data set. The output should be a matrix of dimension 3 x 35; display the columns for 1960, 1977, 1980, 1985.
2b. Display the average unemployment rate by year and the average inflation rate by year, in the same plot. Label the axes and title the plot appropriately. Include an informative legend.
2c. Using split() and sapply(), compute the average unemployment rate for each country, pre and post 1975. The output should be a numeric vector of length 36; display the first 5 entries. Hint: the hard part here is the splitting. There are several ways to do this. One way is as follows: define a new column (say) yearPre1975 to be the indicator that the year column is less than or equal to 1975. Then define a new column (say) countryPre1975 to be the string concatenation of the country and yearPre1975 columns. Then split on countryPre1975 and proceed as usual.
2d. Compute for each country the difference in average unemployment post and pre 1975. Which country had the biggest increase in average unemployment from pre to post 1975? The biggest decrease?

Many linear regressions

3a. In part I of this week’s lecture, we computed the coefficients from regressing strike.volume onto left.parliament, separately for each country in the strikes.df data frame. Following this code example, regress strike.volume onto left.parliament, unemployment, and inflation, separately for each country. The output should be a matrix of dimension 4 x 18 (1 row for the intercept, then 3 rows for the coefficients of left.parliament, unemployment, inflation). Display the columns for Belgium, Canada, UK, and USA.
3b. Again following the code example from lecture, plot the coefficients of left.parliament, from the countrywise regressions of strike.volume onto left.parliament, unemployment, inflation. Does this plot look all that different from the one in lecture?
Challenge. Modify your code for Q3a so that instead of just reporting regression coefficients, you also report their standard errors. Hint: you’ll need to remember how to extract the standard errors from the call summary() on the object returned by lm(); look back at Q2b from last week’s homework. The output should be a matrix of dimension 8 x 18 (1 row for the intercept, 3 rows for the coefficients of left.parliament, unemployment, inflation, and 4 rows for their standard errors). Display the columns for Belgium, Canada, UK, and USA.
Challenge. Reproduce your plot from Q3b, and now on top of each point—denoting a coefficient value of left.parliament for a different country—draw a vertical line segment through this point, extending from the coefficient value minus one standard error to the coefficient value plus one standard error. Hint: use segments(). Make sure that these line segments to not extend past the y limits on your plot. For how many countries do their line segments (from the coefficient value minus one standard error to the coefficient value plus one standard error) not intersect the 0 line? Which ones are they?

Plyr practice

4a. Install the package plyr if you haven’t done so already, and load it into your R session with library(plyr).
4b. Repeat Q1b, but now using an appropriate function from the plyr package to solve the question. Hint: you shouldn’t have to use strikes.by.country at all, you should only need one call to d*ply() (where * is at your choosing).
4c. Repeat Q1c, again using an appropriate plyr function. Hint: use dlply(). Challenge: using daply() or ddply() likely won’t work. That is, if the .fun argument is a function that returns the output of summary() directly, then they won’t work. Explain why. Then show how to fix this, and use them to produce an array or a data frame with the correct summary statistics for each country.
4d. Repeat Q2c, again using an appropriate plyr function. Hint: your solution should be particularly simple compared to your solution to Q2c, as you can just use a single call to daply() without creating any additional columns in strikes.df, like you did in Q2c. Also, remember that you can use I() to create indicator variables on-the-fly.

Lab 10: Plyr and Split-Apply-Combine

Statistical Computing, 36-350

Week of Tuesday April 3, 2018

Strikes data set

Splitting by country

Splitting by year

Many linear regressions

Plyr practice