Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.

This week’s agenda: practicing grouping, pivoting wider and longer, and joins.

# Load the tidyverse!
library(tidyverse)
assertthat::assert_that(utils::packageVersion("tidyr") > "0.8.99") 
## [1] TRUE

Practice with grouping

Below we read in a data frame sprint.m.df containing the top men’s times in the 100m sprint, as seen in previous labs. In the following, unless stated otherwise, use pipes and dplyr verbs to solve each part as cleanly/succintly as you can.

sprint.m.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
  sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)

Practice with pivoting wider and longer

In the following, use pipes and dplyr or tidyr verbs to solve each part as cleanly/succintly as you can. In some parts, it might make more sense to use direct indexing, and that’s perfectly fine.

Practice with joins

Below we read in a data frame sprint.w.df containing the top women’s times in the 100m sprint, as seen in previous labs. In the following, use pipes and dplyr verbs to solve each part as cleanly/succintly as you can. Note: you’ll receive warnings when you make joins about the conversion of factors to characters, and that’s fine, don’t worry about it.

sprint.w.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.w.dat",
  sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)

More grouping and joining

Below is some solution code from Lab 8, where we convert the Birthdate and Date columns in the sprint.m.df and sprint.w.df data frames to numeric form. In what follows, you will resolve some of the questions from Lab 8, but using pipes and dplyr, tidyr.

date.to.numeric = function(val) {
  val = as.character(val)
  vec = strsplit(val, split  = "\\.")[[1]]
  if (nchar(vec[3]) == 2) vec[3] = paste0("19", vec[3])
  vec = as.numeric(vec)
  vec[3]*10^4 + vec[2]*10^2 + vec[1]
}

sprint.m.df$Birthdate = sapply(sprint.m.df$Birthdate, date.to.numeric)
sprint.m.df$Date = sapply(sprint.m.df$Date, date.to.numeric)
sprint.w.df$Birthdate = sapply(sprint.w.df$Birthdate, date.to.numeric)
sprint.w.df$Date = sapply(sprint.w.df$Date, date.to.numeric)

head(sprint.m.df, 5)
##   Rank Time Wind        Name Country Birthdate     City     Date
## 1    1 9.58  0.9  Usain Bolt     JAM  19860821   Berlin 20090816
## 2    2 9.63  1.5  Usain Bolt     JAM  19860821   London 20120805
## 3    3 9.69  0.0  Usain Bolt     JAM  19860821  Beijing 20080816
## 4    3 9.69  2.0   Tyson Gay     USA  19820809 Shanghai 20090920
## 5    3 9.69 -0.1 Yohan Blake     JAM  19891226 Lausanne 20120823
head(sprint.w.df, 5)
##   Rank  Time Wind                     Name Country Birthdate         City
## 1    1 10.49  0,0 Florence Griffith-Joyner     USA  19591221 Indianapolis
## 2    2 10.61 +1,2 Florence Griffith-Joyner     USA  19591221 Indianapolis
## 3    3 10.62 +1,0 Florence Griffith-Joyner     USA  19591221        Seoul
## 4    4 10.64 +1,2          Carmelita Jeter     USA  19791124     Shanghai
## 5    5 10.65 +1,1             Marion Jones     USA  19751012 Johannesburg
##       Date
## 1 19880716
## 2 19880717
## 3 19880924
## 4 20090920
## 5 19980912

Split-apply-combine with nesting (optional)

Sometimes you’d like to preform analysis conditional on a set of groups (think back to the times you’ve used tapply). There’s a paradigm called “split-apply-combine” that defines the steps you’d need to take to preform this type of analysis. In the “tidyverse” this approach can be done using the nesting commands from tidyr.

More specifically, this problem with introduce you to nesting (nest and unnest) as well as the functions purrr::map and some functions from the package broom. Lecture slide #21 provides a link to a lecture (Rmd, html) that covers most of the material in this problem.


For this problem we’ll be looking at a slightly different dataset that can be loaded in using the following:

sprint.best.full.df = read.table(
  file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.best.full.dat", 
  header=TRUE, sep="\t", quote="", stringsAsFactors=TRUE)

This dataset contains information about the best sprinters (conditional on gender) for each year. It contains 3 new columns compared to the above data frames:

  1. Gender (factor): indicates which gender the runner was
  2. Year (integer): which year the time was recorded
  3. Year.centered (integer): relative year

Suppose we were interested in examine the relationship between the best time relative to the year and wind speed conditional on gender. In a linear model, we could model

Time ~ Wind*Gender + Year.centered*Gender + Gender

but today we will instead look at making 2 models (filtering the data by gender) and then looking at the below relationship:

Time ~ Wind + Year.centered

nested.df = sprint.best.full.df %>% 
  group_by(Gender) %>%
  nest()

Suppose, at the end of the day we wanted to compare linear model \(\beta\) coefficients between the two models (1 built with male data, one with female data). The first thing we’d need to do would be to run the linear also as described above. For a single dataset we could do something like what is demonstrated below.

purrr::map(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))
#or
lapply(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))

In “tidyverse” land, let’s use purrr::map. We can create (and store) these linear models into our data frames using mutate, specifically we can do the following (make sure to change the “eval = T”:

nested.df = nested.df %>%
  mutate(model = map(data, function(df) lm(Time ~ Wind + Year.centered, data = df)))

# if for some reason the above doesn't work, try:
my.lm.func = function(df) lm(Time ~ Wind + Year.centered, df)
nested.df = nested.df %>%
  mutate(model = map(data, my.lm.func))

Check what columns nested.df contains. What is the new column’s name? What class of object is stored in each element?

nested.df <- nested.df %>% 
  mutate(sum2 = map(sum, broom::tidy))
unnested.df <- nested.df %>%
  select(Gender, sum2) %>%
  unnest(sum2)

Finally, we call this approach “split-apply-combine” based on the sequence of steps one takes, in this example we could seperate these sequences into:

  1. split: group_by call
  2. apply: all the mutate(purrr::map) style steps
  3. combine: the use of pivot_* to alter the final output