Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 8 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 8 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 1. This document contains 30 of the 45 total points for Homework 8.

Reading in, converting data

sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                       sep="\t", header=TRUE, quote="")
sprint.w.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat",
                       sep="\t", header=TRUE, quote="")
head(sprint.df, 3)
##   Rank Time Wind       Name Country Birthdate    City       Date
## 1    1 9.58  0.9 Usain Bolt     JAM  21.08.86  Berlin 16.08.2009
## 2    2 9.63  1.5 Usain Bolt     JAM  21.08.86  London 05.08.2012
## 3    3 9.69  0.0 Usain Bolt     JAM  21.08.86 Beijing 16.08.2008
head(sprint.w.df, 3)
##   Rank  Time Wind                     Name Country Birthdate         City
## 1    1 10.49  0.0 Florence Griffith-Joyner     USA  21.12.59 Indianapolis
## 2    2 10.61 +1.2 Florence Griffith-Joyner     USA  21.12.59 Indianapolis
## 3    3 10.62 +1.0 Florence Griffith-Joyner     USA  21.12.59        Seoul
##         Date
## 1 16.07.1988
## 2 17.07.1988
## 3 24.09.1988
wind.measurements = as.factor(sample(seq(-2.0, 2.0, by=0.1), 20))
wind.measurements
##  [1] -1.4 -1.1 1.2  -2   0.2  2    -1.6 -1.2 -0.4 1.5  -1.8 -0.7 -0.5 1.9 
## [15] 1.7  -1.3 0.9  -0.3 1.3  -1  
## 20 Levels: -2 -1.8 -1.6 -1.4 -1.3 -1.2 -1.1 -1 -0.7 -0.5 -0.4 -0.3 ... 2

Reordering data

Hw8 Q3 (7 points). For both the men’s and women’s data frames, sprint.df and sprint.w.df, sorting by the Date column would give meaningless results, because this column is a factor. We’re going to convert this column to a numeric variable, called DateNum, that we can meaningfully sort/order. Here is the strategy: the entries of Date are all of the form “DD.MM.YYYY”, so we’re going to convert this to the numeric value (DD) + (MM)*10^2 + (YYYY)*10^4, e.g., we’re going to convert “16.07.2015” to 20150716. Implement this strategy (hint: using strsplit() and sapply()), on both sprint.df and sprint.w.df, adding such a DateNum column these data frames. Display the first 5 rows of each.

Hw8 Q4 (4 points). Reorder the two data frames sprint.df, sprint.w.df so that their rows are in order of increasing date, calling the results sprint.df.by.date, sprint.w.df.by.date, respectively. Display the first 3 and last 3 rows of each. When was the earliest 100m sprint recorded on the men’s side? The women’s side? Then, for each of the men’s and women’s data frames, plot the 100m sprint time versus the calculated DateNum variable, with appropriately labeled axes and an appropriate title. Do you notice a trend—do sprinters appear to be getting faster over time?

Collapsing by track meet

Hw8 Q5 (8 points). Compute a reduced version of the men’s data frame sprint.df that only keeps the fastest time from any one track meet. For example, of all the rows that correspond to sprint times recorded at the “Berlin 16.08.2009” track meet, we will only keep Usain Bolt’s row, since his time of 9.58 is fastest. (Hint: you can use tapply() to do this task; there may be several other ways too.) Call the result sprint.df.best, check that its number of rows is the same as the number of unique men’s track meets, and display its first 5 rows.

Then do the same, but for the women’s data frame sprint.w.df; again call the result sprint.w.df.best, check that its number of rows is the same as the number of unique women’s track meets, and display its first 5 rows.

Merging data

Hw8 Q6 (4 points). We will manually merge together the men’s and women’s data frames, sprint.df.best and sprint.w.df.best, over rows that correspond to times recorded at the same track meet. First, find the common track meets between the two data frames, i.e., the common entries in the CityDate columns of these two data frames. (Hint: use intersect().) Call the result common.meets, and check that it has 385 elements.

Then, compute the indices of rows in the men’s data frame sprint.df.best that correspond to these common track meets. (Hint: use which() and is.element().) Call the result inds.m. Do the same for the women, and call the result inds.w. Check that both inds.m and inds.w have length 385.

Hw8 Q7 (7 points). Now, create a new data frame that merges the columns of sprint.df.best and of sprint.w.df.best, but only keeping the rows that correspond to the common track meets in these two data frames (which, recall, you already know are indexed by inds.m and inds.w, respectively.) Call the result sprint.df.common, and arrange it so that this data frame only has 3 columns: MensTime (the men’s sprint time), WomensTime (the women’s sprint time), and CityDate (the common track meet). Display the first 5 rows of sprint.df.common.

Then, plot the WomensTime variable versus the MensTime variable from the data frame sprint.df.common, with appropriately labeled axes and an appropriate title. This plot is showing the women’s versus men’s times from the common track meets—is there a positive correlation here, i.e., is there a “track meet effect”? This might suggest that there is something about the track meet itself (e.g., the weather, the atmosphere, the crowd, the specific way the track has been constructed/set up, some combination of these) that helps sprinters run faster.

Hw8 Bonus. Lastly, as an alternative to the manual strategy carried out over the last two questions, use the merge() function to merge the data frames sprint.df.best and sprint.w.df.best, over rows that correspond to the common track meets in these two. (Hint: for help on merge(), read the mini-lecture “Merging Data” for its description and for examples, too.) Call the result sprint.df.merged, and check that it has 385 rows.

Plot the Time.y variable versus the Time.x variable from the data frame sprint.df.merged, with appropriately labeled axes and an appropriate title. This plot should look exactly the same as your plot in the last question, because it should be portraying exactly the same data.