Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.
This week’s agenda: practicing grouping, pivoting wider and longer, and joins.
# Load the tidyverse!
library(tidyverse)
assertthat::assert_that(utils::packageVersion("tidyr") > "0.8.99")
## [1] TRUE
Below we read in a data frame sprint.m.df
containing the top men’s times in the 100m sprint, as seen in previous labs. In the following, unless stated otherwise, use pipes and dplyr
verbs to solve each part as cleanly/succintly as you can.
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.m.dat",
sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)
1a. Compute, for each country, the fastest time among athletes who come from that country, and display the first 10 results, ordered alphabetically by country. Also compute, for each city, the fastest time among athletes who ran in that city, and display the first 10 results, ordered alphabetically by city. Hint: group_by()
, summarise()
.
1b. With the most minor modification to your code possible, do the same computations as in the last part, but now display the first 10 results ordered by increasing time. Hint: arrange()
.
1c. Rewrite your solution in the last part using base R. Hint: tapply()
gives probably the easiest route here. Note: your code here shouldn’t be too much more complicated than your code in the last part.
1d. Compute, for each country, the quadruple: (Name, City, Country, Time) corresponding to the athlete with the fastest time among athletes from that country. Display the first 10 results, ordered by increasing time. If there are ties, then show all the results that correspond to the fastest time. Repeat the same computation, but for the fastest time per city. Hint: group_by()
, filter()
, select()
.
1e. Rewrite the rest of your solution in the last part using base R. You should end up with two data frames (per country, and per city) with the exact same structure as in the last part, and display the top 10 rows of each, ordered by increasing time. Hint: there are various routes to go; one strategy is to use split()
, followed by lapply()
with a custom function call, and then rbind()
to get things in a data frame form. Note: your code here will probably be more complicated, or at least less intuitive, than your code in the last part.
1f. With the most minor modification to your code possible, do the same computations as in Q1d, but now when there are ties, pick only one of the relevant results arbitrarily (e.g., uniformly at random is fine).
In the following, use pipes and dplyr
or tidyr
verbs to solve each part as cleanly/succintly as you can. In some parts, it might make more sense to use direct indexing, and that’s perfectly fine.
2a. From sprint.m.df
, define a reduced data frame dat.reduced
as follows. For each athlete, and each city, compute the median of all times they recorded in this city. Your new data frame dat.reduced
should have 1787 rows and 3 columns (Name, City, Time). Confirm that it has these dimensions, and display its first 10 entries.
2b. The data frame dat.reduced
is said to be in “long” format: it has observations on the rows, and variables (Name, City, Time) on the columns. Use pivot_wider()
to convert this into “wide” format, and call the result dat.wide
. Here the first column should be the athlete names, and the remaining columns should correspond to the cities. *Please you the arrange
function (2x) to get the columns (beside the Name
column) and rows in alphabetical order. Apart from the first column, each entry gives the median time recorded by the athlete in this city. What are the dimensions of dat.wide
, and do these make sense to you?
2c. Not counting the names in the first column, how many non-NA
values does dat.wide
have? Does this make sense to you? It should. Reason how could you have guessed this number ahead of time, without even calling pivot_wider()
, based only on dat.reduced
?
2d. From dat.wide
, look at the row for “Usain Bolt”, and determine the city names that do not have NA
values. These should be the cities in which he raced. Determine these cities directly from dat.reduced
, and confirm that they match.
2e. Use pivot_longer()
to convert dat.wide
back into “long” format, and call the result dat.long
. Remove rows that have NA
values (hint: you can do this by setting values_drop_na=TRUE
in the call to pivot_longer()
), and order the rows alphabetically by athlete and city name. Once you’ve done this, dat.wide
should have matching entries to dat.reduced
; confirm that this is the case.
Below we read in a data frame sprint.w.df
containing the top women’s times in the 100m sprint, as seen in previous labs. In the following, use pipes and dplyr
verbs to solve each part as cleanly/succintly as you can. Note: you’ll receive warnings when you make joins about the conversion of factors to characters, and that’s fine, don’t worry about it.
sprint.w.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.w.dat",
sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)
3a. As in Q1f, compute for each country, the triplet (Name, Country, Time) corresponding to the male athlete with the fastest time among athletes from that country, and breaking ties arbitrarily. Instead of displaying the results, save the resulting data frame as dat.m
. Importantly, at the end of your flow of pipe commands used to define dat.m
, make sure to call ungroup()
. This will assure that dat.m
has no groupings associated with it. Do the same for the women, and call the result dat.w
. Report the dimensions of dat.m
and dat.w
, and check that they make sense to you.
3b. Perform an inner join, using inner_join()
, of dat.m
and dat.w
, with the join done by the Country column. Call the resulting data frame dat.ij
, and display its first 10 rows. How many rows does it have in total? Show how could you have arrived at this number ahead of time, from dat.m$Country
and dat.w$Country
(hint: intersect()
). Count the number of NA
values in dat.ij
: this should be zero.
3c. Perform a left join, using left_join()
, of dat.m
and dat.w
, with the join again done by the Country column. Call the resulting data frame dat.lj
, and display its first 10 rows. How many rows does it have in total? Explain why this makes sense. Count the number of NA
values in dat.lj
: this should be 50. Show how you could have arrived at this number from dat.m$Country
and dat.w$Country
(hint: setdiff()
).
3d. Finally, perform an full join, using full_join()
, of dat.m
and dat.w
, with the join again done by the Country column. Call the resulting data frame dat.fj
. How many rows does it have in total? Show how you could have arrived at this number from dat.m$Country
and dat.w$Country
(hint: union()
). Count the number of NA
values in dat.fj
: this should be 80. Challenge: show how you could have arrived at this number from dat.m$Country
and dat.w$Country
.
Below is some solution code from Lab 8, where we convert the Birthdate and Date columns in the sprint.m.df
and sprint.w.df
data frames to numeric form. In what follows, you will resolve some of the questions from Lab 8, but using pipes and dplyr
, tidyr
.
date.to.numeric = function(val) {
val = as.character(val)
vec = strsplit(val, split = "\\.")[[1]]
if (nchar(vec[3]) == 2) vec[3] = paste0("19", vec[3])
vec = as.numeric(vec)
vec[3]*10^4 + vec[2]*10^2 + vec[1]
}
sprint.m.df$Birthdate = sapply(sprint.m.df$Birthdate, date.to.numeric)
sprint.m.df$Date = sapply(sprint.m.df$Date, date.to.numeric)
sprint.w.df$Birthdate = sapply(sprint.w.df$Birthdate, date.to.numeric)
sprint.w.df$Date = sapply(sprint.w.df$Date, date.to.numeric)
head(sprint.m.df, 5)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 19860821 Berlin 20090816
## 2 2 9.63 1.5 Usain Bolt JAM 19860821 London 20120805
## 3 3 9.69 0.0 Usain Bolt JAM 19860821 Beijing 20080816
## 4 3 9.69 2.0 Tyson Gay USA 19820809 Shanghai 20090920
## 5 3 9.69 -0.1 Yohan Blake JAM 19891226 Lausanne 20120823
head(sprint.w.df, 5)
## Rank Time Wind Name Country Birthdate City
## 1 1 10.49 0,0 Florence Griffith-Joyner USA 19591221 Indianapolis
## 2 2 10.61 +1,2 Florence Griffith-Joyner USA 19591221 Indianapolis
## 3 3 10.62 +1,0 Florence Griffith-Joyner USA 19591221 Seoul
## 4 4 10.64 +1,2 Carmelita Jeter USA 19791124 Shanghai
## 5 5 10.65 +1,1 Marion Jones USA 19751012 Johannesburg
## Date
## 1 19880716
## 2 19880717
## 3 19880924
## 4 20090920
## 5 19980912
4a. Here you’ll effectively resolve Q2c and Q2d from Lab 8, using one single flow of pipe commands, for each of the sprint.m.df
and sprint.w.df
data frames. In particular, define a new column CityDate given by concatenating the City and Date columns separated by a “.” (hint: unite()
), then keep only the row with the fastest time for each value of CityDate (breaking ties arbitrarily), then sort the rows by increasing Time Call the resulting data frames dat.m.cd
and dat.w.cd
. Make sure in the last line of pipe commands use to define them, you call ungroup()
. Check that these data frames have dimensions 1253 x 7 and 921 x 7, respectively, and display the first 5 rows of each.
4b. Now you’ll effectively resolve Q3 on Lab 8, using one single flow of pipe commands, for each of the sprint.m.df
and sprint.w.df
data frames. In particular, do an inner join between dat.m.cd
and dat.w.cd
by CityDate, then drop the Rank.x, Rank.y, Birthdate.x, Birthdate.y columns. Call the resulting data frame dat.cd
and check that its dimensions are 377 x 9. Display its first 10 rows, and check that it has no NA
values.
4c. Reproduce the plot you made in Q3d on Lab 8, of Time.y (women’s time) versus Time.x (men’s time), from the dat.cd
data frame. As a reminder, a positive correlation here would indicate some kind of “track meet effect”. Call cor.test()
on Time.x and Time.y and report the p-value. This should all look exactly the same as in Q3d from Lab 8, it’s just a check of reproducibility.
Challenge. In one single flow of pipe commands, for each of sprint.m.df
and sprint.w.df
(i.e., without saving an intermediate object dat.cd
), reproduce the results in Q4b and Q4c. You don’t have to worry about reporting the dimensions of the joined data frame or displaying its first 10 rows; just complete the inner join, produce the plot, and report the p-value from cor.test()
. Hint: to produce the plot before you report the p-value from cor.test()
, you’re going to have to use the “tee” operator %T>%
so that the pipe flow doesn’t terminate prematurely.
nest
ing (optional)Sometimes you’d like to preform analysis conditional on a set of groups (think back to the times you’ve used tapply
). There’s a paradigm called “split-apply-combine” that defines the steps you’d need to take to preform this type of analysis. In the “tidyverse” this approach can be done using the nest
ing commands from tidyr
.
More specifically, this problem with introduce you to nesting (nest
and unnest
) as well as the functions purrr::map
and some functions from the package broom
. Lecture slide #21 provides a link to a lecture (Rmd, html) that covers most of the material in this problem.
For this problem we’ll be looking at a slightly different dataset that can be loaded in using the following:
sprint.best.full.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/sprint.best.full.dat",
header=TRUE, sep="\t", quote="", stringsAsFactors=TRUE)
This dataset contains information about the best sprinters (conditional on gender) for each year. It contains 3 new columns compared to the above data frames:
Gender
(factor): indicates which gender the runner wasYear
(integer): which year the time was recordedYear.centered
(integer): relative yearSuppose we were interested in examine the relationship between the best time relative to the year and wind speed conditional on gender. In a linear model, we could model
Time ~ Wind*Gender + Year.centered*Gender + Gender
but today we will instead look at making 2 models (filtering the data by gender) and then looking at the below relationship:
Time ~ Wind + Year.centered
nested.df$data[[1]]
and describe it (please also identify what subgroup it belongs to).nested.df = sprint.best.full.df %>%
group_by(Gender) %>%
nest()
nest
function “nested” the proportion of the sprint.best.df.full
associated to the specific gender into the column data
in the nested.df
. The nest
function along with the map
function from purrr allows us to preform similar operation in a “tidyverse” way as you learned when you used things like tapply
and lapply
.Suppose, at the end of the day we wanted to compare linear model \(\beta\) coefficients between the two models (1 built with male data, one with female data). The first thing we’d need to do would be to run the linear also as described above. For a single dataset we could do something like what is demonstrated below.
purrr::map(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))
#or
lapply(nested.df$data[1],function(df) lm(Time ~ Wind + Year.centered, data = df))
In “tidyverse” land, let’s use purrr::map
. We can create (and store) these linear models into our data frames using mutate, specifically we can do the following (make sure to change the “eval = T”:
nested.df = nested.df %>%
mutate(model = map(data, function(df) lm(Time ~ Wind + Year.centered, data = df)))
# if for some reason the above doesn't work, try:
my.lm.func = function(df) lm(Time ~ Wind + Year.centered, df)
nested.df = nested.df %>%
mutate(model = map(data, my.lm.func))
Check what columns nested.df
contains. What is the new column’s name? What class of object is stored in each element?
6c. Now, we want to grab out the coefficents (and for now, suppose also the full summary
). Update the nested.df
such that we have a summary of each model in a new column called sum
. Remember you should use map
and that you’re applying summary
to the models, not the data.
6d. (No work, just reading) What you should be noticing is that this approach allows you to interatively write your code (which has it’s benefits). Sadly we need a final step (which we provide for you). We will discuss why in the last part of this question (summary). (Make sure to correct eval = F
.)
nested.df <- nested.df %>%
mutate(sum2 = map(sum, broom::tidy))
unnest
. We provide the code for you below. Why do you think we use select(Gender, coef2)
? Express in words how the unnested data frame changes if we don’t include that line of code.unnested.df <- nested.df %>%
select(Gender, sum2) %>%
unnest(sum2)
6f. Finally, create a table using that has 2 rows (for each gender) and contains the beta coefficents of each of the terms in the model (define this “table” as beta.model.df
and print it out). Hint: you’ll probably use a pivot_*
and a select
call. Looking at this table and back at unnested.df
does it appear that the effect of year (conditional on Wind speed) is stronger for male runners or female runners? (Note these models isn’t super amazing—so you shouldn’t really see this as a take away.)
Summary. You’ve now gotten a test of the “split-apply-combine” paradigm using nest
/unnest
, purrr
and a little bit of functions from the broom
library. This approach should appear similar to apply
style coding but a bit more iterative. You may have noticed that we had a previous extra step in Q6d that you might not have expected; as tidyverse emphasis is on data.frames, we end up needing to work with data frame to make sure that the unnest
ing works as expected.
Finally, we call this approach “split-apply-combine” based on the sequence of steps one takes, in this example we could seperate these sequences into:
group_by
callmutate(purrr::map)
style stepspivot_*
to alter the final output