Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 10 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 10 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 15. This document contains 22 of the 45 total points for Homework 10.

Split-apply-combine practice with the strikes data

Data on the political economy of strikes, as described in the “Split-Apply-Combine” mini-lectures, is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv. Read this into your R session and call the resulting data frame strikes.df. Check that is has 625 rows and 8 columns, and display its first 5 rows.
Split strikes.df by country, using the split() function. Call the resulting list strikes.by.country, and show the names of elements the list, as well as the first 5 rows of the data frame for Canada.
Using strikes.by.country and sapply(), compute the average unemployment rate for each country. What country has the highest average unemployment rate? The lowest?

Hw10 Bonus. Using the map() function from the maps package, draw a map of the world, with the countries in the strikes.df data frame colored according to their average unemployment rate. For the color palette, use terrain.colors(). For all countries not found in the strikes.df data frame, color them in gray.

Using strikes.by.country and sapply(), compute a summary (min, quartiles, max) of the unemployment rate for each country. Study the output—do its dimensions make sense to you?
Using strikes.by.country and just one call to sapply(), compute the average unemployment rate, inflation rates, and strike volume for each country. The output should be a matrix of dimension 3 x 18. Also, with just the one call to sapply(), figure out how to make the output matrix have appropriate row names (to your choosing).

Hw10 Q1 (8 points). Using split() and sapply(), compute the average unemployment rate, inflation rates, and strike volume for each year in the strikes.df data set. The output should be a matrix of dimension 3 x 35. Show the columns for 1960, 1977, 1980, 1985. Then, display the average unemployment rate by year and the average inflation rate by year, in the same plot. Label the axes and title the plot appropriately. Include an informative legend.

Using strikes.df, split(), and sapply(), compute the average inflation rate for each country, pre and post 1975. The output should be a numeric vector of length 36. (Hint: the hard part here is the splitting. There are several ways to do this. One way is as follows: define a new column (say) yearPre1975 to be the indicator that the year column is less than or equal to 1975. Then define a new column (say) countryPre1975 to be the string concatenation of the country and yearPre1975 columns. Then split on countryPre1975 and proceed as usual.)
Using the result from the last question, compute for each country the difference in average unemployment post and pre 1975. Which country had the biggest increase in average unemployment from pre to post 1975? The biggest decrease?

Hw10 Q2 (4 points). Show how to compute the average inflation rate for each country pre and post 1975, from strikes.df, using a single call to daply(), i.e., without using any auxiliary columns in strikes.df, like the ones you created in yearPre1975, countryPre1975. You will need to have gone through the “Plyr: d\*ply()” mini-lecture to do this question, so you might want to come to this one after class on Wednesday or Friday. (Hint: recall the function I().) Check that the results are the same as those you computed above, with split() and sapply().

Linear regressions over the strikes data

In the “Split-Apply-Combine” mini-lecture, we computed the coefficients from regressing strike.volume onto left.parliament, separately for each country in the strikes.df data frame. Following this code structure, regress strike.volume onto left.parliament, unemployment, and inflation, separately for each country. The output should be a matrix of dimension 4 x 18 (1 row for the intercept, then 3 rows for the coefficients of left.parliament, unemployment, inflation). Display the columns for Belgium, Canada, UK, and USA.
Following the code at the end of the “Split-Apply-Combine” mini-lecture, plot the coefficients of left.parliament, from the countrywise regressions of strike.volume onto left.parliament, unemployment, inflation.

Hw10 Q3 (10 points). Modify your code for computing the coefficients from regresssing strike.volume onto left.parliament, unemployment, and inflation, separately for each country in the strikes.df data frame, so that instead of just reporting the coefficients, you also report their standard errors. (Hint: you will need to figure out how to extract the standard errors from the call summary() on the object returned by lm(). Look at the solution to one of the bonus questions on Hw9.) The output should be a matrix of dimension 8 x 18 (1 row for the intercept, 3 rows for the coefficients of left.parliament, unemployment, inflation, and 4 rows for their standard errors). Display the columns for Belgium, Canada, UK, and USA.

Finally, reproduce your plot from the last question of the coefficients of left.parliament, from the countrywise regressions of strike.volume onto left.parliament, unemployment, inflation. But now on top of each point—denoting a coefficient value of left.parliament for a different country—draw a vertical line segment through this point, extending from the coefficient value minus one standard error to the coefficient value plus one standard error. (Hint: segments().) Make sure that these line segments to not extend past the y limits on your plot. For how many countries do their line segments (from the coefficient value minus one standard error to the coefficient value plus one standard error) not intersect the 0 line? Which ones are they?

Lab 11m: Split-Apply-Combine

Statistical Computing, 36-350

Monday November 7, 2016

Split-apply-combine practice with the strikes data

Linear regressions over the strikes data