Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
This week’s agenda: practicing split-apply-combine, getting familiar with plyr functions.
Data on the political economy of strikes (from Bruce Western, in the Sociology Department at Harvard University) is up at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv. The data features 18 countries of 35 years. The measured variables:
country
, year
: country and year of data collectionstrike.volume
: days on strike per 1000 workersunemployment
: unemployment rateinflation
: inflation rateleft.parliament
: leftwing share of the govermentcentralization
: centralization of unionsdensity
: density of unionsWe read it into our R session below.
strikes.df =
read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/strikes.csv")
head(strikes.df, 3)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43
## 2 Australia 1952 397 2.2 17.2 43
## 3 Australia 1953 360 2.5 4.3 43
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
1a. Split strikes.df
by country, using the split()
function. Call the resulting list strikes.by.country
, and display the names of elements the list, as well as the first 3 rows of the data frame for Canada.
1b. Using strikes.by.country
and sapply()
, compute the average unemployment rate for each country. What country has the highest average unemployment rate? The lowest?
1c. Using strikes.by.country
and sapply()
, compute a summary (min, quartiles, max) of the unemployment rate for each country. Display the output matrix; do its dimensions make sense to you?
1d. Using strikes.by.country
and just one call to sapply()
, compute the average unemployment rate, inflation rates, and strike volume for each country. The output should be a matrix of dimension 3 x 18; display it. Challenge: with just the one call to sapply()
, figure out how to make the output matrix have appropriate row names (of your choosing).
2a. Using split()
and sapply()
, compute the average unemployment rate, inflation rates, and strike volume for each year in the strikes.df
data set. The output should be a matrix of dimension 3 x 35; display the columns for 1960, 1977, 1980, 1985.
2b. Display the average unemployment rate by year and the average inflation rate by year, in the same plot. Label the axes and title the plot appropriately. Include an informative legend.
2c. Using split()
and sapply()
, compute the average unemployment rate for each country, pre and post 1975. The output should be a numeric vector of length 36; display the first 5 entries. Hint: the hard part here is the splitting. There are several ways to do this. One way is as follows: define a new column (say) yearPre1975
to be the indicator that the year
column is less than or equal to 1975. Then define a new column (say) countryPre1975
to be the string concatenation of the country
and yearPre1975
columns. Then split on countryPre1975
and proceed as usual.
2d. Compute for each country the difference in average unemployment post and pre 1975. Which country had the biggest increase in average unemployment from pre to post 1975? The biggest decrease?
3a. In part I of this week’s lecture, we computed the coefficients from regressing strike.volume
onto left.parliament
, separately for each country in the strikes.df
data frame. Following this code example, regress strike.volume
onto left.parliament
, unemployment
, and inflation
, separately for each country. The output should be a matrix of dimension 4 x 18 (1 row for the intercept, then 3 rows for the coefficients of left.parliament
, unemployment
, inflation
). Display the columns for Belgium, Canada, UK, and USA.
3b. Again following the code example from lecture, plot the coefficients of left.parliament
, from the countrywise regressions of strike.volume
onto left.parliament
, unemployment
, inflation
. Does this plot look all that different from the one in lecture?
Challenge. Modify your code for Q3a so that instead of just reporting regression coefficients, you also report their standard errors. Hint: you’ll need to remember how to extract the standard errors from the call summary()
on the object returned by lm()
; look back at Q2b from last week’s homework. The output should be a matrix of dimension 8 x 18 (1 row for the intercept, 3 rows for the coefficients of left.parliament
, unemployment
, inflation
, and 4 rows for their standard errors). Display the columns for Belgium, Canada, UK, and USA.
Challenge. Reproduce your plot from Q3b, and now on top of each point—denoting a coefficient value of left.parliament
for a different country—draw a vertical line segment through this point, extending from the coefficient value minus one standard error to the coefficient value plus one standard error. Hint: use segments()
. Make sure that these line segments to not extend past the y limits on your plot. For how many countries do their line segments (from the coefficient value minus one standard error to the coefficient value plus one standard error) not intersect the 0 line? Which ones are they?
4a. Install the package plyr
if you haven’t done so already, and load it into your R session with library(plyr)
.
4b. Repeat Q1b, but now using an appropriate function from the plyr
package to solve the question. Hint: you shouldn’t have to use strikes.by.country
at all, you should only need one call to d*ply()
(where *
is at your choosing).
4c. Repeat Q1c, again using an appropriate plyr
function. Hint: use dlply()
. Challenge: using daply()
or ddply()
likely won’t work. That is, if the .fun
argument is a function that returns the output of summary()
directly, then they won’t work. Explain why. Then show how to fix this, and use them to produce an array or a data frame with the correct summary statistics for each country.
4d. Repeat Q2c, again using an appropriate plyr
function. Hint: your solution should be particularly simple compared to your solution to Q2c, as you can just use a single call to daply()
without creating any additional columns in strikes.df
, like you did in Q2c. Also, remember that you can use I()
to create indicator variables on-the-fly.