Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.

This week’s agenda: practicing grouping, pivoting wider and longer, and joins.

# Load the tidyverse!
library(tidyverse)
assertthat::assert_that(utils::packageVersion("tidyr") > "0.8.99")

## [1] TRUE

Practice with grouping

Below we read in a data frame sprint.m.df containing the top men’s times in the 100m sprint, as seen in previous labs. In the following, unless stated otherwise, use pipes and dplyr verbs to solve each part as cleanly/succintly as you can.

sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
  sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)

1a. Compute, for each country, the fastest time among athletes who come from that country, and display the first 10 results, ordered alphabetically by country. Also compute, for each city, the fastest time among athletes who ran in that city, and display the first 10 results, ordered alphabetically by city. Hint: group_by(), summarise().
1b. With the most minor modification to your code possible, do the same computations as in the last part, but now display the first 10 results ordered by increasing time. Hint: arrange().
1c. Rewrite your solution in the last part using base R. Hint: tapply() gives probably the easiest route here. Note: your code here shouldn’t be too much more complicated than your code in the last part.
1d. Compute, for each country, the quadruple: (Name, City, Country, Time) corresponding to the athlete with the fastest time among athletes from that country. Display the first 10 results, ordered by increasing time. If there are ties, then show all the results that correspond to the fastest time. Repeat the same computation, but for the fastest time per city. Hint: group_by(), filter(), select().
1e. Rewrite the rest of your solution in the last part using base R. You should end up with two data frames (per country, and per city) with the exact same structure as in the last part, and display the top 10 rows of each, ordered by increasing time. Hint: there are various routes to go; one strategy is to use split(), followed by lapply() with a custom function call, and then rbind() to get things in a data frame form. Note: your code here will probably be more complicated, or at least less intuitive, than your code in the last part.
1f. With the most minor modification to your code possible, do the same computations as in Q1d, but now when there are ties, pick only one of the relevant results arbitrarily (e.g., uniformly at random is fine).

Practice with pivoting wider and longer

In the following, use pipes and dplyr or tidyr verbs to solve each part as cleanly/succintly as you can. In some parts, it might make more sense to use direct indexing, and that’s perfectly fine.

2a. From sprint.m.df, define a reduced data frame dat.reduced as follows. For each athlete, and each city, compute the median of all times they recorded in this city. Your new data frame dat.reduced should have 1787 rows and 3 columns (Name, City, Time). Confirm that it has these dimensions, and display its first 10 entries.
2b. The data frame dat.reduced is said to be in “long” format: it has observations on the rows, and variables (Name, City, Time) on the columns. Use pivot_wider() to convert this into “wide” format, and call the result dat.wide. Here the first column should be the athlete names, and the remaining columns should correspond to the cities. *Please you the arrange function (2x) to get the columns (beside the Name column) and rows in alphabetical order. Apart from the first column, each entry gives the median time recorded by the athlete in this city. What are the dimensions of dat.wide, and do these make sense to you?
2c. Not counting the names in the first column, how many non-NA values does dat.wide have? Does this make sense to you? It should. Reason how could you have guessed this number ahead of time, without even calling pivot_wider(), based only on dat.reduced?
2d. From dat.wide, look at the row for “Usain Bolt”, and determine the city names that do not have NA values. These should be the cities in which he raced. Determine these cities directly from dat.reduced, and confirm that they match.
2e. Use pivot_longer() to convert dat.wide back into “long” format, and call the result dat.long. Remove rows that have NA values (hint: you can do this by setting values_drop_na=TRUE in the call to pivot_longer()), and order the rows alphabetically by athlete and city name. Once you’ve done this, dat.wide should have matching entries to dat.reduced; confirm that this is the case.

Practice with joins

Below we read in a data frame sprint.w.df containing the top women’s times in the 100m sprint, as seen in previous labs. In the following, use pipes and dplyr verbs to solve each part as cleanly/succintly as you can. Note: you’ll receive warnings when you make joins about the conversion of factors to characters, and that’s fine, don’t worry about it.

sprint.w.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.w.dat",
  sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)

3a. As in Q1f, compute for each country, the triplet (Name, Country, Time) corresponding to the male athlete with the fastest time among athletes from that country, and breaking ties arbitrarily. Instead of displaying the results, save the resulting data frame as dat.m. Importantly, at the end of your flow of pipe commands used to define dat.m, make sure to call ungroup(). This will assure that dat.m has no groupings associated with it. Do the same for the women, and call the result dat.w. Report the dimensions of dat.m and dat.w, and check that they make sense to you.
3b. Perform an inner join, using inner_join(), of dat.m and dat.w, with the join done by the Country column. Call the resulting data frame dat.ij, and display its first 10 rows. How many rows does it have in total? Show how could you have arrived at this number ahead of time, from dat.m$Country and dat.w$Country (hint: intersect()). Count the number of NA values in dat.ij: this should be zero.
3c. Perform a left join, using left_join(), of dat.m and dat.w, with the join again done by the Country column. Call the resulting data frame dat.lj, and display its first 10 rows. How many rows does it have in total? Explain why this makes sense. Count the number of NA values in dat.lj: this should be 50. Show how you could have arrived at this number from dat.m$Country and dat.w$Country (hint: setdiff()).
3d. Finally, perform an full join, using full_join(), of dat.m and dat.w, with the join again done by the Country column. Call the resulting data frame dat.fj. How many rows does it have in total? Show how you could have arrived at this number from dat.m$Country and dat.w$Country (hint: union()). Count the number of NA values in dat.fj: this should be 80. Challenge: show how you could have arrived at this number from dat.m$Country and dat.w$Country.

More grouping and joining

Below is some solution code from Lab 8, where we convert the Birthdate and Date columns in the sprint.m.df and sprint.w.df data frames to numeric form. In what follows, you will resolve some of the questions from Lab 8, but using pipes and dplyr, tidyr.

date.to.numeric = function(val) {
  val = as.character(val)
  vec = strsplit(val, split  = "\\.")[[1]]
  if (nchar(vec[3]) == 2) vec[3] = paste0("19", vec[3])
  vec = as.numeric(vec)
  vec[3]*10^4 + vec[2]*10^2 + vec[1]
}

sprint.m.df$Birthdate = sapply(sprint.m.df$Birthdate, date.to.numeric)
sprint.m.df$Date = sapply(sprint.m.df$Date, date.to.numeric)
sprint.w.df$Birthdate = sapply(sprint.w.df$Birthdate, date.to.numeric)
sprint.w.df$Date = sapply(sprint.w.df$Date, date.to.numeric)

head(sprint.m.df, 5)

##   Rank Time Wind        Name Country Birthdate     City     Date
## 1    1 9.58  0.9  Usain Bolt     JAM  19860821   Berlin 20090816
## 2    2 9.63  1.5  Usain Bolt     JAM  19860821   London 20120805
## 3    3 9.69  0.0  Usain Bolt     JAM  19860821  Beijing 20080816
## 4    3 9.69  2.0   Tyson Gay     USA  19820809 Shanghai 20090920
## 5    3 9.69 -0.1 Yohan Blake     JAM  19891226 Lausanne 20120823

head(sprint.w.df, 5)

##   Rank  Time Wind                     Name Country Birthdate         City
## 1    1 10.49  0,0 Florence Griffith-Joyner     USA  19591221 Indianapolis
## 2    2 10.61 +1,2 Florence Griffith-Joyner     USA  19591221 Indianapolis
## 3    3 10.62 +1,0 Florence Griffith-Joyner     USA  19591221        Seoul
## 4    4 10.64 +1,2          Carmelita Jeter     USA  19791124     Shanghai
## 5    5 10.65 +1,1             Marion Jones     USA  19751012 Johannesburg
##       Date
## 1 19880716
## 2 19880717
## 3 19880924
## 4 20090920
## 5 19980912

4a. Here you’ll effectively resolve Q2c and Q2d from Lab 8, using one single flow of pipe commands, for each of the sprint.m.df and sprint.w.df data frames. In particular, define a new column CityDate given by concatenating the City and Date columns separated by a “.” (hint: unite()), then keep only the row with the fastest time for each value of CityDate (breaking ties arbitrarily), then sort the rows by increasing Time Call the resulting data frames dat.m.cd and dat.w.cd. Make sure in the last line of pipe commands use to define them, you call ungroup(). Check that these data frames have dimensions 1253 x 7 and 921 x 7, respectively, and display the first 5 rows of each.
4b. Now you’ll effectively resolve Q3 on Lab 8, using one single flow of pipe commands, for each of the sprint.m.df and sprint.w.df data frames. In particular, do an inner join between dat.m.cd and dat.w.cd by CityDate, then drop the Rank.x, Rank.y, Birthdate.x, Birthdate.y columns. Call the resulting data frame dat.cd and check that its dimensions are 377 x 9. Display its first 10 rows, and check that it has no NA values.
4c. Reproduce the plot you made in Q3d on Lab 8, of Time.y (women’s time) versus Time.x (men’s time), from the dat.cd data frame. As a reminder, a positive correlation here would indicate some kind of “track meet effect”. Call cor.test() on Time.x and Time.y and report the p-value. This should all look exactly the same as in Q3d from Lab 8, it’s just a check of reproducibility.
Challenge. In one single flow of pipe commands, for each of sprint.m.df and sprint.w.df (i.e., without saving an intermediate object dat.cd), reproduce the results in Q4b and Q4c. You don’t have to worry about reporting the dimensions of the joined data frame or displaying its first 10 rows; just complete the inner join, produce the plot, and report the p-value from cor.test(). Hint: to produce the plot before you report the p-value from cor.test(), you’re going to have to use the “tee” operator %T>% so that the pipe flow doesn’t terminate prematurely.

Split-apply-combine with `nest`ing (optional)

Sometimes you’d like to preform analysis conditional on a set of groups (think back to the times you’ve used tapply). There’s a paradigm called “split-apply-combine” that defines the steps you’d need to take to preform this type of analysis. In the “tidyverse” this approach can be done using the nesting commands from tidyr.

More specifically, this problem with introduce you to nesting (nest and unnest) as well as the functions purrr::map and some functions from the package broom. Lecture slide #21 provides a link to a lecture (Rmd, html) that covers most of the material in this problem.

For this problem we’ll be looking at a slightly different dataset that can be loaded in using the following:

sprint.best.full.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.best.full.dat", 
  header=TRUE, sep="\t", quote="", stringsAsFactors=TRUE)

This dataset contains information about the best sprinters (conditional on gender) for each year. It contains 3 new columns compared to the above data frames:

Gender (factor): indicates which gender the runner was
Year (integer): which year the time was recorded
Year.centered (integer): relative year

Suppose we were interested in examine the relationship between the best time relative to the year and wind speed conditional on gender. In a linear model, we could model

Time ~ Wind*Gender + Year.centered*Gender + Gender

but today we will instead look at making 2 models (filtering the data by gender) and then looking at the below relationship:

Time ~ Wind + Year.centered

6a. Run the following line of code (note you’ll need to remove the “eval = FALSE”). What is the size of nested.df? What are the column names? Examine the element nested.df$data[[1]] and describe it (please also identify what subgroup it belongs to).

nested.df = sprint.best.full.df %>% 
  group_by(Gender) %>%
  nest()

6b. You probably noticed in the last part that the nest function “nested” the proportion of the sprint.best.df.full associated to the specific gender into the column data in the nested.df. The nest function along with the map function from purrr allows us to preform similar operation in a “tidyverse” way as you learned when you used things like tapply and lapply.

Suppose, at the end of the day we wanted to compare linear model $\beta$ coefficients between the two models (1 built with male data, one with female data). The first thing we’d need to do would be to run the linear also as described above. For a single dataset we could do something like what is demonstrated below.

purrr::map(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))
#or
lapply(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))

In “tidyverse” land, let’s use purrr::map. We can create (and store) these linear models into our data frames using mutate, specifically we can do the following (make sure to change the “eval = T”:

nested.df = nested.df %>%
  mutate(model = map(data, function(df) lm(Time ~ Wind + Year.centered, data = df)))

# if for some reason the above doesn't work, try:
my.lm.func = function(df) lm(Time ~ Wind + Year.centered, df)
nested.df = nested.df %>%
  mutate(model = map(data, my.lm.func))

Check what columns nested.df contains. What is the new column’s name? What class of object is stored in each element?

6c. Now, we want to grab out the coefficents (and for now, suppose also the full summary). Update the nested.df such that we have a summary of each model in a new column called sum. Remember you should use map and that you’re applying summary to the models, not the data.
6d. (No work, just reading) What you should be noticing is that this approach allows you to interatively write your code (which has it’s benefits). Sadly we need a final step (which we provide for you). We will discuss why in the last part of this question (summary). (Make sure to correct eval = F.)

nested.df <- nested.df %>% 
  mutate(sum2 = map(sum, broom::tidy))

6e. Now we’d like to pull out the the summary information out of this “nested” format. To do so we use the function unnest. We provide the code for you below. Why do you think we use select(Gender, coef2)? Express in words how the unnested data frame changes if we don’t include that line of code.

unnested.df <- nested.df %>%
  select(Gender, sum2) %>%
  unnest(sum2)

6f. Finally, create a table using that has 2 rows (for each gender) and contains the beta coefficents of each of the terms in the model (define this “table” as beta.model.df and print it out). Hint: you’ll probably use a pivot_* and a select call. Looking at this table and back at unnested.df does it appear that the effect of year (conditional on Wind speed) is stronger for male runners or female runners? (Note these models isn’t super amazing—so you shouldn’t really see this as a take away.)
Summary. You’ve now gotten a test of the “split-apply-combine” paradigm using nest/unnest, purrr and a little bit of functions from the broom library. This approach should appear similar to apply style coding but a bit more iterative. You may have noticed that we had a previous extra step in Q6d that you might not have expected; as tidyverse emphasis is on data.frames, we end up needing to work with data frame to make sure that the unnesting works as expected.

Finally, we call this approach “split-apply-combine” based on the sequence of steps one takes, in this example we could seperate these sequences into:

split: group_by call
apply: all the mutate(purrr::map) style steps
combine: the use of pivot_* to alter the final output

Lab 11: Tidyverse II: Tidyr and Advanced Dplyr

Statistical Computing, 36-350

Week of Monday November 5, 2018

Practice with grouping

Practice with pivoting wider and longer

Practice with joins

More grouping and joining

Split-apply-combine with `nest`ing (optional)

Lab 11: Tidyverse II: Tidyr and Advanced Dplyr

Statistical Computing, 36-350

Week of Monday November 5, 2018

Practice with grouping

Practice with pivoting wider and longer

Practice with joins

More grouping and joining

Split-apply-combine with nesting (optional)

Split-apply-combine with `nest`ing (optional)