Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 16. This document contains 15 of the 45 total points for Homework 6.

The states data frame

Below we construct a data frame, of 50 states x 10 variables. The first 8 are numeric, and the last 2 are factors. You don’t have to do anything yet. (For more information on what is exactly being measured here, type ?state in your console.)

state.df = data.frame(state.x77, Region=state.region, Division=state.division)
head(state.df)

##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area Region           Division
## Alabama     50708  South East South Central
## Alaska     566432   West            Pacific
## Arizona    113417   West           Mountain
## Arkansas    51945  South West South Central
## California 156361   West            Pacific
## Colorado   103766   West           Mountain

Add a column to state.df, containing the state abbreviations that are stored in the (existing) vector state.abb. Name this column Abbr. You can do this in (at least) two ways: by using a call to data.frame(), or by directly defining state.df$Abbr. Display the first 3 rows and all 11 columns of state.df.
Remove the Region column from state.df. You can do this in (at least) two ways: by using negative indexing, or by directly setting state.df$Region to be NULL. Display the first 3 rows and all 10 columns of state.df.
Add two columns to state.df, containing the x and y coordinates (longitude and latitude) of the center of the states, that are stored in the (existing) list state.center. (Hint: take a look at this list in the console, to see what its elements are named.) Name these two columns Center.x and Center.y. Display the first 3 rows and all 12 columns of state.df.

Hw6 Q1 (6 points). Plot the state centers in state.df, i.e., plot the Center.y column (on the y-axis) versus the Center.x column (on the x-axis). Use the regular point type, but set cex=5 to get very large empty circles. Then, in the center of these empty circles, draw the state abbreviations. (Hint: use text().) Label the axes and title the plot appropriately.

Now let’s do something more interesting with colors. Plot the state centers, with cex=5 again, but this time use filled circles, having a colors that reflect the values in the Frost column. The highest Frost value should be assigned a light blue color, and the lowest Frost value a pink color, with appropriate interpolation of colors in between. (Hint: recall customRampPalette(), and the function get.col.from.val(), from the “Curves, Surfaces, and Colors” mini-lecture.) Then, again, in the center of these empty circles, draw the state abbreviations, label the axes, and title the plot appropriately. Does the plot make sense to you, i.e., do you see an expected geographic pattern, where Frost (the average number of days with minimum temperature below freezing) tends to be highest?

Access tasks with the states data frame

What is the illiteracy percentage for Pennsylvania? The illiteracy percentage for West Virginia? Display the illiteracy percentages for all states in the Middle Atlantic division, and for all states in the South Atlantic Division. Finally, report the median illiteracy percentage for all states in Middle Atlantic division, and for all states in the South Atlantic Division. (Hint: your entire answer to this part should be 6 lines of code.)
Plot, using state.df, the life expectancy versus average number of days with frost, across the states. Also plot the life expectancy versus the murder rate of the states. In each case, use appropriate axis labels. Do you notice a trend, in each case?

Hw6 Bonus. In each of the last two plots, add the line of best fit.

Calculate, from state.df, the following: the gross income of the states (the amount of total money made in each state), named (say) state.income.gross, and then the income per square mile of the states (the total amount of money divided by the area of each state), named (say) state.income.area. Assign names to these variables to match the state names; display the first 5 entries of each. Report the income per capita, gross income, and income per square mile of Texas. Which state has the highest income per capita, which has the highest gross income, and which has the highest income per square mile?

Hw6 Q2 (4 points). Recall in the “Data Frames” mini-lecture we saw that the apply() function could be used on columns (or rows, as well) of data frames. E.g., the code below calculates the maximum value of each of the first 8 numeric variables in state.df, which are, recall, just taken from the matrix state.x77.

head(state.x77) # We'll consider only the numeric variables

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

max.vals = apply(state.x77, 2, max) # Compute the max of each column
max.vals

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##    21198.0     6315.0        2.8       73.6       15.1       67.3 
##      Frost       Area 
##      188.0   566432.0

Using apply(), compute which state achieves the max value in each column, saving this as max.state. (Hint: this should only take one line of code.) Then compute the min value in each column, and which state achieves the min for each column, saving these as min.vals and min.state, respectively. Finally, create a new data frame, called state.extremes: it should have 8 rows, and 4 columns. The rows should be assigned the names of the first 8 numeric columns in state.df, and the columns should be assigned the names “Max.Value”, “Max.State”, “Min.Value”, and “Min.State”. The columns should be populated using max.vals, max.state, min.vals, and min.state. Display the entries of this new data frame.

The strikes data frame

Below we read in data on the political economy of strikes. The data was collected by Bruce Western, in the Sociology Department at Harvard University, and cleaned by Cosma Shalizi from the Statistics Department here at CMU. We create a data frame strike.df of 625 rows x 8 columns: country, year, days on strike per 1000 workers, unemployment, inflation, leftwing share of government, centralization of unions, union density. You don’t have to do anything yet.

strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
class(strike.df)

## [1] "data.frame"

head(strike.df)

##     country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951           296          1.3      19.8            43.0
## 2 Australia 1952           397          2.2      17.2            43.0
## 3 Australia 1953           360          2.5       4.3            43.0
## 4 Australia 1954             3          1.7       0.7            47.0
## 5 Australia 1955           326          1.4       2.0            38.5
## 6 Australia 1956           352          1.8       6.3            38.5
##   centralization density
## 1      0.3748588      NA
## 2      0.3751829      NA
## 3      0.3745076      NA
## 4      0.3710170      NA
## 5      0.3752675      NA
## 6      0.3716072      NA

What are the unique countries represented in the strike.df data frame? How many years of data are there available for each country? (Hint: you should only need two lines of code here. For the second, consider using table().)
Define strike.df.canada to be a data frame of dimenion 35 x 8, gotten by extracting the rows from strike.df on data from Canada. Display its first 5 rows and all 8 columns. Plot the unemployment rate in Canada versus the year, with appropriately labeled axes and title. What was the highest unemployment rate and in what year did it occur?

Hw6 Q3 (5 points). Write a function called country.var.summary() that takes the following inputs: strike.df, the strikes data frame; where, a string giving the name of a country that appears in the strike data frame, with a default value of “USA”; what, a string giving name of a variable that appears in the strikes data frame, in columns 3 through 8, with a default value of “strike.volume”; and plot.it, a boolean signaling whether we should produce a plot, with a default value of TRUE. As a side effect, if plot.it is TRUE, then the function should produce a plot of the specified variable versus the year, for the specified country. The labels and title should be set appropriately. The output of the function should be a vector of summary statistics on the specified variable, in the specified country, as computed by summary(). As an example, your function should produce the same plot as in the last question when country="Canada" and var="unemployment", and its output should be as follows.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.400   4.350   5.600   6.043   7.250  11.800

After you’ve written it, use your function to produce plots and summaries of the strike volume in France and the US. Then use it to produce plots and summaries of the unemployment rate in Italy and Germany. Then use it to produce summaries (no plots) of the inflation rate in Denmark and Finland.

Lab 7m: Data Frames

Statistical Computing, 36-350

Monday October 10, 2016

The states data frame

Access tasks with the states data frame

The strikes data frame