Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 8 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 8 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 1. This document contains 30 of the 45 total points for Homework 8.

Reading in, converting data

Below we read in data on the 2829 fastest men’s 100m sprint times, as well as the 2018 fastest women’s 100m sprint times, saved as data frames sprint.df and sprint.w.df. (You don’t have to do anything yet.)

sprint.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                       sep="\t", header=TRUE, quote="")
sprint.w.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat",
                       sep="\t", header=TRUE, quote="")
head(sprint.df, 3)

##   Rank Time Wind       Name Country Birthdate    City       Date
## 1    1 9.58  0.9 Usain Bolt     JAM  21.08.86  Berlin 16.08.2009
## 2    2 9.63  1.5 Usain Bolt     JAM  21.08.86  London 05.08.2012
## 3    3 9.69  0.0 Usain Bolt     JAM  21.08.86 Beijing 16.08.2008

head(sprint.w.df, 3)

##   Rank  Time Wind                     Name Country Birthdate         City
## 1    1 10.49  0.0 Florence Griffith-Joyner     USA  21.12.59 Indianapolis
## 2    2 10.61 +1.2 Florence Griffith-Joyner     USA  21.12.59 Indianapolis
## 3    3 10.62 +1.0 Florence Griffith-Joyner     USA  21.12.59        Seoul
##         Date
## 1 16.07.1988
## 2 17.07.1988
## 3 24.09.1988

Reorder the data frame sprint.df, so that its rows are in order of decreasing wind assistance, calling the results sprint.df.by.wind, respectively. A value of, e.g., 2.0, in the Wind column, means a 2.0 m/s tailwind (helping the sprinter), and a value of e.g., -2.0, means a 2.0 m/s headwind (hurting the sprinter). Display the first 5 rows of sprint.df.
Do the same for sprint.w.df, reordering it so that its rows are in order of decreasing wind assistance. Display the first 5 rows—what do you notice about the wind measurements? They shouldn’t be in decreasing order. Why did this happen with the women’s data frame, and not the men’s data frame? (Hint: to answer this, look at the class of sprint.w.df$Wind versus that of sprint.df$Wind.)
The function read.table() made a decision to treat the Wind column as a numeric variable for the men’s data frame and a factor variable for the women’s data frame. We want to convert the Wind column in the women’s data frame to be a numeric variable. Converting factors to numerics can be an annoyingly frustrating task in general, so it’s good to practice it. Below is a factor variable wind.measurements for you to play around with, as a testing ground. Show how to properly convert it to a numeric variable. (Hint: perhaps not obvious how to do it initially, but the answer shouldn’t take more than one line of code; use the as.xxx() type functions.)

wind.measurements = as.factor(sample(seq(-2.0, 2.0, by=0.1), 20))
wind.measurements

##  [1] -1.4 -1.1 1.2  -2   0.2  2    -1.6 -1.2 -0.4 1.5  -1.8 -0.7 -0.5 1.9 
## [15] 1.7  -1.3 0.9  -0.3 1.3  -1  
## 20 Levels: -2 -1.8 -1.6 -1.4 -1.3 -1.2 -1.1 -1 -0.7 -0.5 -0.4 -0.3 ... 2

Using the strategy you developed in the last question, convert the Wind column of sprint.w.df into a numeric variable, using this strategy. You should get exactly one NA from this process. What was the wind entry that failed to be converted into a numeric (hence becoming NA)? Now, can you explain why read.table() chose to read in the Wind column as a factor variable for the women’s data frame (as opposed to a numeric variable, as it did for the men’s data frame)?
For each of the men’s and women’s data frames, plot the the 100m sprint time versus the wind measurement. Label the axes and title the plot appropriately. Do you notice a trend—does more wind assistance mean faster sprint times? Where do the fastest men’s time, and for the fastest women’s time, lie among this trend? (Remark: there’s an interesting story behind the wind measurement that was recorded for the fastest women’s time, you might enjoy reading about it online …)

Reordering data

Hw8 Q3 (7 points). For both the men’s and women’s data frames, sprint.df and sprint.w.df, sorting by the Date column would give meaningless results, because this column is a factor. We’re going to convert this column to a numeric variable, called DateNum, that we can meaningfully sort/order. Here is the strategy: the entries of Date are all of the form “DD.MM.YYYY”, so we’re going to convert this to the numeric value (DD) + (MM)*10^2 + (YYYY)*10^4, e.g., we’re going to convert “16.07.2015” to 20150716. Implement this strategy (hint: using strsplit() and sapply()), on both sprint.df and sprint.w.df, adding such a DateNum column these data frames. Display the first 5 rows of each.

Hw8 Q4 (4 points). Reorder the two data frames sprint.df, sprint.w.df so that their rows are in order of increasing date, calling the results sprint.df.by.date, sprint.w.df.by.date, respectively. Display the first 3 and last 3 rows of each. When was the earliest 100m sprint recorded on the men’s side? The women’s side? Then, for each of the men’s and women’s data frames, plot the 100m sprint time versus the calculated DateNum variable, with appropriately labeled axes and an appropriate title. Do you notice a trend—do sprinters appear to be getting faster over time?

Collapsing by track meet

For each of the men’s and women’s data frames, sprint.df and sprint.w.df append a column called CityDate that is defined by concatenating the string entries in the City and Date columns. For example, entries “Berlin” and “16.08.2009” in the City and Date columns, respectively, produce an entry of “Berlin 16.08.2009” in the CityDate column. From here on, we’re going to assume that every unique combination of the city and date, in the CityDate column, produces a unique track meet. How many unique track meets occur in the men’s data frame, and how many in the women’s data frame? And how many other sprint times from the men’s data frame were recorded at the same track meet as Usain Bolt’s legendary time of 9.58 seconds?

Hw8 Q5 (8 points). Compute a reduced version of the men’s data frame sprint.df that only keeps the fastest time from any one track meet. For example, of all the rows that correspond to sprint times recorded at the “Berlin 16.08.2009” track meet, we will only keep Usain Bolt’s row, since his time of 9.58 is fastest. (Hint: you can use tapply() to do this task; there may be several other ways too.) Call the result sprint.df.best, check that its number of rows is the same as the number of unique men’s track meets, and display its first 5 rows.

Then do the same, but for the women’s data frame sprint.w.df; again call the result sprint.w.df.best, check that its number of rows is the same as the number of unique women’s track meets, and display its first 5 rows.

Merging data

Hw8 Q6 (4 points). We will manually merge together the men’s and women’s data frames, sprint.df.best and sprint.w.df.best, over rows that correspond to times recorded at the same track meet. First, find the common track meets between the two data frames, i.e., the common entries in the CityDate columns of these two data frames. (Hint: use intersect().) Call the result common.meets, and check that it has 385 elements.

Then, compute the indices of rows in the men’s data frame sprint.df.best that correspond to these common track meets. (Hint: use which() and is.element().) Call the result inds.m. Do the same for the women, and call the result inds.w. Check that both inds.m and inds.w have length 385.

Hw8 Q7 (7 points). Now, create a new data frame that merges the columns of sprint.df.best and of sprint.w.df.best, but only keeping the rows that correspond to the common track meets in these two data frames (which, recall, you already know are indexed by inds.m and inds.w, respectively.) Call the result sprint.df.common, and arrange it so that this data frame only has 3 columns: MensTime (the men’s sprint time), WomensTime (the women’s sprint time), and CityDate (the common track meet). Display the first 5 rows of sprint.df.common.

Then, plot the WomensTime variable versus the MensTime variable from the data frame sprint.df.common, with appropriately labeled axes and an appropriate title. This plot is showing the women’s versus men’s times from the common track meets—is there a positive correlation here, i.e., is there a “track meet effect”? This might suggest that there is something about the track meet itself (e.g., the weather, the atmosphere, the crowd, the specific way the track has been constructed/set up, some combination of these) that helps sprinters run faster.

Hw8 Bonus. Lastly, as an alternative to the manual strategy carried out over the last two questions, use the merge() function to merge the data frames sprint.df.best and sprint.w.df.best, over rows that correspond to the common track meets in these two. (Hint: for help on merge(), read the mini-lecture “Merging Data” for its description and for examples, too.) Call the result sprint.df.merged, and check that it has 385 rows.

Plot the Time.y variable versus the Time.x variable from the data frame sprint.df.merged, with appropriately labeled axes and an appropriate title. This plot should look exactly the same as your plot in the last question, because it should be portraying exactly the same data.

Lab 9f: Reading in, Reordering, Merging Data

Statistical Computing, 36-350

Friday October 28, 2016

Reading in, converting data

Reordering data

Collapsing by track meet

Merging data