Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week.
This week’s agenda: practicing grouping, spreading and gathering, and joins.
# Load the tidyverse!
library(tidyverse)
Below we read in a data frame sprint.m.df
containing the top men’s times in the 100m sprint, as seen in previous labs. In the following, unless stated otherwise, use pipes and dplyr
verbs to solve each part as cleanly/succintly as you can.
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F18/data/sprint.m.dat",
sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)
1a. Compute, for each country, the fastest time among athletes who come from that country, and display the first 10 results, ordered alphabetically by country. Also compute, for each city, the fastest time among athletes who ran in that city, and display the first 10 results, ordered alphabetically by city. Hint: group_by()
, summarise()
.
1b. With the most minor modification to your code possible, do the same computations as in the last part, but now display the first 10 results ordered by increasing time. Hint: arrange()
.
1c. Rewrite your solution in the last part using base R. Hint: tapply()
gives probably the easiest route here. Note: your code here shouldn’t be too much more complicated than your code in the last part.
1d. Compute, for each country, the quadruple: (Name, City, Country, Time) corresponding to the athlete with the fastest time among athletes from that country. Display the first 10 results, ordered by increasing time. If there are ties, then show all the results that correspond to the fastest time. Repeat the same computation, but for the fastest time per city. Hint: group_by()
, filter()
, select()
.
1e. Rewrite the rest of your solution in the last part using base R. You should end up with two data frames (per country, and per city) with the exact same structure as in the last part, and display the top 10 rows of each, ordered by increasing time. Hint: there are various routes to go; one strategy is to use split()
, followed by lapply()
with a custom function call, and then rbind()
to get things in a data frame form. Note: your code here will probably be more complicated, or at least less intuitive, than your code in the last part.
1f. With the most minor modification to your code possible, do the same computations as in Q1d, but now when there are ties, pick only one of the relevant results arbitrarily (e.g., uniformly at random is fine).
In the following, use pipes and dplyr
or tidyr
verbs to solve each part as cleanly/succintly as you can. In some parts, it might make more sense to use direct indexing, and that’s perfectly fine.
2a. From sprint.m.df
, define a reduced data frame dat.reduced
as follows. For each athlete, and each city, compute the median of all times they recorded in this city. Your new data frame dat.reduced
should have 1787 rows and 3 columns (Name, City, Time). Confirm that it has these dimensions, and display its first 10 entries.
2b. The data frame dat.reduced
is said to be in “long” format: it has observations on the rows, and variables (Name, City, Time) on the columns. Use spread()
to convert this into “wide” format, and call the result dat.wide
. Here the first column should be the athlete names, and the remaining columns should correspond to the cities. Apart from the first column, each entry gives the median time recorded by the athlete in this city. What are the dimensions of dat.wide
, and do these make sense to you?
2c. Not counting the names in the first column, how many non-NA
values does dat.wide
have? Does this make sense to you? It should. Reason how could you have guessed this number ahead of time, without even calling spread()
, based only on dat.reduced
?
2d. From dat.wide
, look at the row for “Usain Bolt”, and determine the city names that do not have NA
values. These should be the cities in which he raced. Determine these cities directly from dat.reduced
, and confirm that they match.
2e. Use gather()
to convert dat.wide
back into “long” format, and call the result dat.long
. Remove rows that have NA
values (hint: you can do this by setting na.rm=TRUE
in the call to gather()
), and order the rows alphabetically by athlete name. Once you’ve done this, dat.wide
should have matching entries to dat.reduced
; confirm that this is the case.
Below we read in a data frame sprint.w.df
containing the top women’s times in the 100m sprint, as seen in previous labs. In the following, use pipes and dplyr
verbs to solve each part as cleanly/succintly as you can. Note: you’ll receive warnings when you make joins about the conversion of factors to characters, and that’s fine, don’t worry about it.
sprint.w.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp-F18/data/sprint.w.dat",
sep="\t", header=TRUE, quote="", stringsAsFactors=TRUE)
3a. As in Q1f, compute for each country, the triplet (Name, Country, Time) corresponding to the male athlete with the fastest time among athletes from that country, and breaking ties arbitrarily. Instead of displaying the results, save the resulting data frame as dat.m
. Importantly, at the end of your flow of pipe commands used to define dat.m
, make sure to call ungroup()
. This will assure that dat.m
has no groupings associated with it. Do the same for the women, and call the result dat.w
. Report the dimensions of dat.m
and dat.w
, and check that they make sense to you.
3b. Perform an inner join, using inner_join()
, of dat.m
and dat.w
, with the join done by the Country column. Call the resulting data frame dat.ij
, and display its first 10 rows. How many rows does it have in total? Show how could you have arrived at this number ahead of time, from dat.m$Country
and dat.w$Country
(hint: intersect()
). Count the number of NA
values in dat.ij
: this should be zero.
3c. Perform a left join, using left_join()
, of dat.m
and dat.w
, with the join again done by the Country column. Call the resulting data frame dat.lj
, and display its first 10 rows. How many rows does it have in total? Explain why this makes sense. Count the number of NA
values in dat.lj
: this should be 50. Show how you could have arrived at this number from dat.m$Country
and dat.w$Country
(hint: setdiff()
).
3d. Finally, perform an full join, using full_join()
, of dat.m
and dat.w
, with the join again done by the Country column. Call the resulting data frame dat.fj
. How many rows does it have in total? Show how you could have arrived at this number from dat.m$Country
and dat.w$Country
(hint: union()
). Count the number of NA
values in dat.fj
: this should be 80. Challenge: show how you could have arrived at this number from dat.m$Country
and dat.w$Country
.
Below is some solution code from Lab 8, where we convert the Birthdate and Date columns in the sprint.m.df
and sprint.w.df
data frames to numeric form. In what follows, you will resolve some of the questions from Lab 8, but using pipes and dplyr
, tidyr
.
date.to.numeric = function(val) {
val = as.character(val)
vec = strsplit(val, split = "\\.")[[1]]
if (nchar(vec[3]) == 2) vec[3] = paste0("19", vec[3])
vec = as.numeric(vec)
vec[3]*10^4 + vec[2]*10^2 + vec[1]
}
sprint.m.df$Birthdate = sapply(sprint.m.df$Birthdate, date.to.numeric)
sprint.m.df$Date = sapply(sprint.m.df$Date, date.to.numeric)
sprint.w.df$Birthdate = sapply(sprint.w.df$Birthdate, date.to.numeric)
sprint.w.df$Date = sapply(sprint.w.df$Date, date.to.numeric)
head(sprint.m.df, 5)
## Rank Time Wind Name Country Birthdate City Date
## 1 1 9.58 0.9 Usain Bolt JAM 19860821 Berlin 20090816
## 2 2 9.63 1.5 Usain Bolt JAM 19860821 London 20120805
## 3 3 9.69 0.0 Usain Bolt JAM 19860821 Beijing 20080816
## 4 3 9.69 2.0 Tyson Gay USA 19820809 Shanghai 20090920
## 5 3 9.69 -0.1 Yohan Blake JAM 19891226 Lausanne 20120823
head(sprint.w.df, 5)
## Rank Time Wind Name Country Birthdate City
## 1 1 10.49 0,0 Florence Griffith-Joyner USA 19591221 Indianapolis
## 2 2 10.61 +1,2 Florence Griffith-Joyner USA 19591221 Indianapolis
## 3 3 10.62 +1,0 Florence Griffith-Joyner USA 19591221 Seoul
## 4 4 10.64 +1,2 Carmelita Jeter USA 19791124 Shanghai
## 5 5 10.65 +1,1 Marion Jones USA 19751012 Johannesburg
## Date
## 1 19880716
## 2 19880717
## 3 19880924
## 4 20090920
## 5 19980912
4a. Here you’ll effectively resolve Q2c and Q2d from Lab 8, using one single flow of pipe commands, for each of the sprint.m.df
and sprint.w.df
data frames. In particular, define a new column CityDate given by concatenating the City and Date columns separated by a “.” (hint: unite()
), then keep only the row with the fastest time for each value of CityDate (breaking ties arbitrarily), then sort the rows by increasing Time Call the resulting data frames dat.m.cd
and dat.w.cd
. Make sure in the last line of pipe commands use to define them, you call ungroup()
. Check that these data frames have dimensions 1253 x 7 and 921 x 7, respectively, and display the first 5 rows of each.
4b. Now you’ll effectively resolve Q3 on Lab 8, using one single flow of pipe commands, for each of the sprint.m.df
and sprint.w.df
data frames. In particular, do an inner join between dat.m.cd
and dat.w.cd
by CityDate, then drop the Rank.x, Rank.y, Birthdate.x, Birthdate.y columns. Call the resulting data frame dat.cd
and check that its dimensions are 377 x 9. Display its first 10 rows, and check that it has no NA
values.
4c. Reproduce the plot you made in Q3d on Lab 8, of Time.y (women’s time) versus Time.x (men’s time), from the dat.cd
data frame. As a reminder, a positive correlation here would indicate some kind of “track meet effect”. Call cor.test()
on Time.x and Time.y and report the p-value. This should all look exactly the same as in Q3d from Lab 8, it’s just a check of reproducibility.
Challenge. In one single flow of pipe commands, for each of sprint.m.df
and sprint.w.df
(i.e., without saving an intermediate object dat.cd
), reproduce the results in Q4b and Q4c. You don’t have to worry about reporting the dimensions of the joined data frame or displaying its first 10 rows; just complete the inner join, produce the plot, and report the p-value from cor.test()
. Hint: to produce the plot before you report the p-value from cor.test()
, you’re going to have to use the “tee” operator %T>%
so that the pipe flow doesn’t terminate prematurely.