Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 16. This document contains 14 of the 45 total points for Homework 6.
state.x77
matrix below, which, recall is a matrix of 50 states x 8 numeric variables. Create a data frame out of this matrix, called state.df
, by stacking the factors state.region
and state.division
onto its columns, and naming these two new columns appropriately. Using your new data frame, compute the average income in each of the 4 regions of the US. (Hint: use tapply()
.) What region has the largest? Compute the average income in each of the 9 divisions of the US. What division has the largest?head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
Write a function called grad.by.lit.median()
, which takes a single input: df
, a data frame that is expected to have the same 10 column types and column names as the state.df
data frame. This function should compute, for each row in the data frame df
, the percentage of high school graduates divided by the percentage of literate (note: literate, not illiterate) individuals, times 100. (Hint: you should be able to do this with simple vectorization, you shouldn’t need a for()
loop.) Let’s call this the “graduate-by-literate” percentage—think of this as, roughly, the percentage of literate individuals who graduated high school. Your function should then return the median of these computed graduate-by-literate percentages. Check that grad.by.lit(state.df)
gives 53.59844.
Split the rows of the data frame state.df
by division, and call the resulting list state.df.by.div
, having length 9, with one element per division. To be clear, each element of this list should be a data frame whose rows have been extracted from state.df
and correspond to states in a particular division. (Hint: use the split()
function, as in the mini-lecture “More Apply”.) Then display just the first 2 rows of each data frame in the list state.df.by.div
. (Hint: use lapply()
.)
Hw6 Q7 (2 points). For each division, compute and display the median graduate-by-literate percentage. (Hint: use state.df.by.div
and sapply()
.) Which division has the highest median graduate-by-literate percentage?
Hw6 Q8 (2 points). For each division, compute and display the median HS graduation percentage. Do so using sapply()
on state.df.by.div
, with the FUN
input defined “on-the-fly”. Are these percentages generally higher or lower than the median graduate-by-literate percentages, and are you surprised by this result? Which division has the highest median HS graduation percentage?
sprint.df
. Note that this is indeed a data frame. (You don’t have to do anything yet.)sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
sep="\t", quote="", header=TRUE)
class(sprint.df)
## [1] "data.frame"
head(sprint.df)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 21.08.86 Berlin 16.08.2009
## 2 2 9.63 1.5 Usain Bolt JAM 21.08.86 London 05.08.2012
## 3 3 9.69 0.0 Usain Bolt JAM 21.08.86 Beijing 16.08.2008
## 4 3 9.69 2.0 Tyson Gay USA 09.08.82 Shanghai 20.09.2009
## 5 3 9.69 -0.1 Yohan Blake JAM 26.12.89 Lausanne 23.08.2012
## 6 6 9.71 0.9 Tyson Gay USA 09.08.82 Berlin 16.08.2009
Extract the last four digits of each entry of the Date
column. (Hint: you will have to use as.character()
to convert to character type.) Append this as a column to sprint.df
, under the name Year
, and display the first 5 rows and all (now) 9 colmns of sprint.df
. Then, check: what is the class of the newly created Year
column?
Compute, from sprint.df
and the newly created Year
column, the median 100m sprint time in each year of the data frame. (Hint: use tapply()
.) Call the resulting vector med.time.by.year
, and plot its entries versus the years, with appropriate axes labels and an appropriate title. Does it look like the median sprint time has roughly gone down, or gone up, over the years?
Hw6 Q9 (2 points). Compute, from sprint.df
and the newly created Year
column, the fastest 100m sprint time in each year of the data frame, calling the result fast.time.by.year
. Plot this by year, as in the last question. Has the fastest sprint time roughly gone down, or gone up, over the years?
Hw6 Bonus. Given a set of x,y pairs, the greatest convex minorant is defined as the biggest convex function that lies below the graph of the x,y pairs. We call particular x,y points extreme points if the greatest convex minorant passes through these points. Looking at your plot from the last question (so that x denotes years and y the fastest sprint times), which points are extreme points? You can answer this visually, or programmatically. Which runner was responsible for such times, and what does that roughly suggest about their performances?
Hw6 Q10 (8 points). Read in the data table at “http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat”, on 2018 women’s 100m sprint times, saving it as a data frame sprint.w.df
. Repeat the steps leading up through Hw6 Q9, to produce fast.w.time.by.year
, a vector with the fastest women’s sprint time in each year, and plot this by year. Does it look like the fastest sprint time for women has roughly gone down, or gone up, over the years?
Finally, produce a single plot that shows both the trends fast.time.by.year
by year and fast.w.time.by.year
by year. Make sure to set the x and y limits so that all points are visible. Use different colors for the men times and the women times, and draw a legend indicating what is what. Label the axes and title the plot appropriately.