Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 16. This document contains 15 of the 45 total points for Homework 6.
?state
in your console.)state.df = data.frame(state.x77, Region=state.region, Division=state.division)
head(state.df)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Region Division
## Alabama 50708 South East South Central
## Alaska 566432 West Pacific
## Arizona 113417 West Mountain
## Arkansas 51945 South West South Central
## California 156361 West Pacific
## Colorado 103766 West Mountain
Add a column to state.df
, containing the state abbreviations that are stored in the (existing) vector state.abb
. Name this column Abbr
. You can do this in (at least) two ways: by using a call to data.frame()
, or by directly defining state.df$Abbr
. Display the first 3 rows and all 11 columns of state.df
.
Remove the Region
column from state.df
. You can do this in (at least) two ways: by using negative indexing, or by directly setting state.df$Region
to be NULL. Display the first 3 rows and all 10 columns of state.df
.
Add two columns to state.df
, containing the x and y coordinates (longitude and latitude) of the center of the states, that are stored in the (existing) list state.center
. (Hint: take a look at this list in the console, to see what its elements are named.) Name these two columns Center.x
and Center.y
. Display the first 3 rows and all 12 columns of state.df
.
Hw6 Q1 (6 points). Plot the state centers in state.df
, i.e., plot the Center.y
column (on the y-axis) versus the Center.x
column (on the x-axis). Use the regular point type, but set cex=5
to get very large empty circles. Then, in the center of these empty circles, draw the state abbreviations. (Hint: use text()
.) Label the axes and title the plot appropriately.
Now let’s do something more interesting with colors. Plot the state centers, with cex=5
again, but this time use filled circles, having a colors that reflect the values in the Frost
column. The highest Frost
value should be assigned a light blue color, and the lowest Frost
value a pink color, with appropriate interpolation of colors in between. (Hint: recall customRampPalette()
, and the function get.col.from.val()
, from the “Curves, Surfaces, and Colors” mini-lecture.) Then, again, in the center of these empty circles, draw the state abbreviations, label the axes, and title the plot appropriately. Does the plot make sense to you, i.e., do you see an expected geographic pattern, where Frost
(the average number of days with minimum temperature below freezing) tends to be highest?
What is the illiteracy percentage for Pennsylvania? The illiteracy percentage for West Virginia? Display the illiteracy percentages for all states in the Middle Atlantic division, and for all states in the South Atlantic Division. Finally, report the median illiteracy percentage for all states in Middle Atlantic division, and for all states in the South Atlantic Division. (Hint: your entire answer to this part should be 6 lines of code.)
Plot, using state.df
, the life expectancy versus average number of days with frost, across the states. Also plot the life expectancy versus the murder rate of the states. In each case, use appropriate axis labels. Do you notice a trend, in each case?
Hw6 Bonus. In each of the last two plots, add the line of best fit.
state.df
, the following: the gross income of the states (the amount of total money made in each state), named (say) state.income.gross
, and then the income per square mile of the states (the total amount of money divided by the area of each state), named (say) state.income.area
. Assign names to these variables to match the state names; display the first 5 entries of each. Report the income per capita, gross income, and income per square mile of Texas. Which state has the highest income per capita, which has the highest gross income, and which has the highest income per square mile?Hw6 Q2 (4 points). Recall in the “Data Frames” mini-lecture we saw that the apply()
function could be used on columns (or rows, as well) of data frames. E.g., the code below calculates the maximum value of each of the first 8 numeric variables in state.df
, which are, recall, just taken from the matrix state.x77
.
head(state.x77) # We'll consider only the numeric variables
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
max.vals = apply(state.x77, 2, max) # Compute the max of each column
max.vals
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
Using apply()
, compute which state achieves the max value in each column, saving this as max.state
. (Hint: this should only take one line of code.) Then compute the min value in each column, and which state achieves the min for each column, saving these as min.vals
and min.state
, respectively. Finally, create a new data frame, called state.extremes
: it should have 8 rows, and 4 columns. The rows should be assigned the names of the first 8 numeric columns in state.df
, and the columns should be assigned the names “Max.Value”, “Max.State”, “Min.Value”, and “Min.State”. The columns should be populated using max.vals
, max.state
, min.vals
, and min.state
. Display the entries of this new data frame.
strike.df
of 625 rows x 8 columns: country, year, days on strike per 1000 workers, unemployment, inflation, leftwing share of government, centralization of unions, union density. You don’t have to do anything yet.strike.df = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/strikes.csv")
class(strike.df)
## [1] "data.frame"
head(strike.df)
## country year strike.volume unemployment inflation left.parliament
## 1 Australia 1951 296 1.3 19.8 43.0
## 2 Australia 1952 397 2.2 17.2 43.0
## 3 Australia 1953 360 2.5 4.3 43.0
## 4 Australia 1954 3 1.7 0.7 47.0
## 5 Australia 1955 326 1.4 2.0 38.5
## 6 Australia 1956 352 1.8 6.3 38.5
## centralization density
## 1 0.3748588 NA
## 2 0.3751829 NA
## 3 0.3745076 NA
## 4 0.3710170 NA
## 5 0.3752675 NA
## 6 0.3716072 NA
What are the unique countries represented in the strike.df
data frame? How many years of data are there available for each country? (Hint: you should only need two lines of code here. For the second, consider using table()
.)
Define strike.df.canada
to be a data frame of dimenion 35 x 8, gotten by extracting the rows from strike.df
on data from Canada. Display its first 5 rows and all 8 columns. Plot the unemployment rate in Canada versus the year, with appropriately labeled axes and title. What was the highest unemployment rate and in what year did it occur?
Hw6 Q3 (5 points). Write a function called country.var.summary()
that takes the following inputs: strike.df
, the strikes data frame; where
, a string giving the name of a country that appears in the strike data frame, with a default value of “USA”; what
, a string giving name of a variable that appears in the strikes data frame, in columns 3 through 8, with a default value of “strike.volume”; and plot.it
, a boolean signaling whether we should produce a plot, with a default value of TRUE. As a side effect, if plot.it
is TRUE, then the function should produce a plot of the specified variable versus the year, for the specified country. The labels and title should be set appropriately. The output of the function should be a vector of summary statistics on the specified variable, in the specified country, as computed by summary()
. As an example, your function should produce the same plot as in the last question when country="Canada"
and var="unemployment"
, and its output should be as follows.
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.400 4.350 5.600 6.043 7.250 11.800
After you’ve written it, use your function to produce plots and summaries of the strike volume in France and the US. Then use it to produce plots and summaries of the unemployment rate in Italy and Germany. Then use it to produce summaries (no plots) of the inflation rate in Denmark and Finland.