Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 6 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 6 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 16. This document contains 14 of the 45 total points for Homework 6.

Back to the states data frame

For your convenience, we preview the state.x77 matrix below, which, recall is a matrix of 50 states x 8 numeric variables. Create a data frame out of this matrix, called state.df, by stacking the factors state.region and state.division onto its columns, and naming these two new columns appropriately. Using your new data frame, compute the average income in each of the 4 regions of the US. (Hint: use tapply().) What region has the largest? Compute the average income in each of the 9 divisions of the US. What division has the largest?

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

Write a function called grad.by.lit.median(), which takes a single input: df, a data frame that is expected to have the same 10 column types and column names as the state.df data frame. This function should compute, for each row in the data frame df, the percentage of high school graduates divided by the percentage of literate (note: literate, not illiterate) individuals, times 100. (Hint: you should be able to do this with simple vectorization, you shouldn’t need a for() loop.) Let’s call this the “graduate-by-literate” percentage—think of this as, roughly, the percentage of literate individuals who graduated high school. Your function should then return the median of these computed graduate-by-literate percentages. Check that grad.by.lit(state.df) gives 53.59844.
Split the rows of the data frame state.df by division, and call the resulting list state.df.by.div, having length 9, with one element per division. To be clear, each element of this list should be a data frame whose rows have been extracted from state.df and correspond to states in a particular division. (Hint: use the split() function, as in the mini-lecture “More Apply”.) Then display just the first 2 rows of each data frame in the list state.df.by.div. (Hint: use lapply().)

Hw6 Q7 (2 points). For each division, compute and display the median graduate-by-literate percentage. (Hint: use state.df.by.div and sapply().) Which division has the highest median graduate-by-literate percentage?

Hw6 Q8 (2 points). For each division, compute and display the median HS graduation percentage. Do so using sapply() on state.df.by.div, with the FUN input defined “on-the-fly”. Are these percentages generally higher or lower than the median graduate-by-literate percentages, and are you surprised by this result? Which division has the highest median HS graduation percentage?

The sprints data frame

Below, we read in a data table on 2829 men’s 100m sprint times, saved as sprint.df. Note that this is indeed a data frame. (You don’t have to do anything yet.)

sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                       sep="\t", quote="", header=TRUE)
class(sprint.df)

## [1] "data.frame"

head(sprint.df)

##   Rank Time Wind        Name Country Birthdate     City       Date
## 1    1 9.58  0.9  Usain Bolt     JAM  21.08.86   Berlin 16.08.2009
## 2    2 9.63  1.5  Usain Bolt     JAM  21.08.86   London 05.08.2012
## 3    3 9.69  0.0  Usain Bolt     JAM  21.08.86  Beijing 16.08.2008
## 4    3 9.69  2.0   Tyson Gay     USA  09.08.82 Shanghai 20.09.2009
## 5    3 9.69 -0.1 Yohan Blake     JAM  26.12.89 Lausanne 23.08.2012
## 6    6 9.71  0.9   Tyson Gay     USA  09.08.82   Berlin 16.08.2009

Extract the last four digits of each entry of the Date column. (Hint: you will have to use as.character() to convert to character type.) Append this as a column to sprint.df, under the name Year, and display the first 5 rows and all (now) 9 colmns of sprint.df. Then, check: what is the class of the newly created Year column?
Compute, from sprint.df and the newly created Year column, the median 100m sprint time in each year of the data frame. (Hint: use tapply().) Call the resulting vector med.time.by.year, and plot its entries versus the years, with appropriate axes labels and an appropriate title. Does it look like the median sprint time has roughly gone down, or gone up, over the years?

Hw6 Q9 (2 points). Compute, from sprint.df and the newly created Year column, the fastest 100m sprint time in each year of the data frame, calling the result fast.time.by.year. Plot this by year, as in the last question. Has the fastest sprint time roughly gone down, or gone up, over the years?

Hw6 Bonus. Given a set of x,y pairs, the greatest convex minorant is defined as the biggest convex function that lies below the graph of the x,y pairs. We call particular x,y points extreme points if the greatest convex minorant passes through these points. Looking at your plot from the last question (so that x denotes years and y the fastest sprint times), which points are extreme points? You can answer this visually, or programmatically. Which runner was responsible for such times, and what does that roughly suggest about their performances?

Hw6 Q10 (8 points). Read in the data table at “http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat”, on 2018 women’s 100m sprint times, saving it as a data frame sprint.w.df. Repeat the steps leading up through Hw6 Q9, to produce fast.w.time.by.year, a vector with the fastest women’s sprint time in each year, and plot this by year. Does it look like the fastest sprint time for women has roughly gone down, or gone up, over the years?

Finally, produce a single plot that shows both the trends fast.time.by.year by year and fast.w.time.by.year by year. Make sure to set the x and y limits so that all points are visible. Use different colors for the men times and the women times, and draw a legend indicating what is what. Label the axes and title the plot appropriately.

Lab 7f: More on Apply

Statistical Computing, 36-350

Friday October 14, 2016

Back to the states data frame

The sprints data frame