Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Friday 9pm, this week.
This week’s agenda: practicing how to use the map family of functions from purrr
, and how to perform basic computations on data frames using dplyr
.
Below we install tidyverse
which gives us the packages we need (purrr
and dplyr
) needed to complete this lab. We also install the repurrrsive
package which has the Game of Thrones data set that we’ll use for the first couple of questions. Since this may be the first time installing packages for some of you, we’ll show you how. If you already have these packages installed, then you can of course skip this part. Note: do not remove eval=FALSE
from the above code chunk, just run the lines below in your console. You can also select “Tools” –> “Install Packages” from the RStudio menu.
install.packages(tidyverse)
install.packages(repurrrsive)
Now we’ll load these packages. Note: the code chunk below will cause errors if you try to knit this file without installing the packages first.
Post edit. Loading the tidyverse
package in its entirety includes plyr
, and this can cause namespace issues with the dplyr
package. Better to just load only what you need.
library(purrr)
library(dplyr)
library(repurrrsive)
Below we inspect a data set on the 30 characters from Game of Thrones from the repurrrsive
package. It’s stored in a list called got_chars
, which is automatically loaded into your R session when you load the `repurrrsive`` package.
class(got_chars)
## [1] "list"
length(got_chars)
## [1] 30
names(got_chars[[1]])
## [1] "url" "id" "name" "gender" "culture" "born"
## [7] "died" "alive" "titles" "aliases" "father" "mother"
## [13] "spouse" "allegiances" "books" "povBooks" "tvSeries" "playedBy"
got_chars[[1]]$name
## [1] "Theon Greyjoy"
got_chars[[1]]$aliases
## [1] "Prince of Fools" "Theon Turncloak" "Reek" "Theon Kinslayer"
1a. Using the map functions from the purrr
package, extract the names of the characters in got_chars
so that you produce a character vector of length 30. Do this four different ways: (i) using map()
, defining a custom function on-the-fly, and casting the resulting list into an appropriate data structure; (ii) using one of the map_***()
functions, but still defining a custom function on-the-fly; (iii) using one of the map_***()
functions, and using one of `[`()
or `[[`()
functions, as well as an additional argument; (iv) using one of the map_***()
functions, and passing a string instead of a function (relying on its ability to define an appropriate extractor accordingly).
Store each of the results in a different vector and check that they are all identical.
# YOUR CODE GOES HERE
# YOUR CODE GOES HERE
1c. Run the code below in your console. What does it do?
1:5 %in% 3:6
Using the logic you can infer about the %in%
operator (you can also read its help file), craft a single line of code to compute a Boolean vector of length 6 that checks whether the first Game of Thrones character, stored in got_chars[[1]]
, has appeared in each of the 6 TV seasons. Print the result to the console.
# YOUR CODE GOES HERE
1d. Run the two lines of code below in their console. What do they do?
rbind(1:5, 6:10, 11:15)
do.call(rbind, list(1:5, 6:10, 11:15))
Using the logic you can infer about the do.call()
function (you can also read its help file), as well as the logic from the last question, complete the following task. Using map()
, a custom-defined function, as well as some post-processing of its results, produce a matrix that has dimension 30 x 6, with each column representing a TV season, and each row a character. The matrix should have a value of TRUE
in position (i,j) if character i was in season j, and FALSE
otherwise. Print the first 6 rows of the result to the console.
# YOUR CODE GOES HERE
map_df()
and no post-processing. The result will now be a data frame (not a matrix). Print the first 6 rows of the result to the console. Hint: map_dfr()
will throw an error if it can’t infer column names.# YOUR CODE GOES HERE
map_dfr()
, create a data frame of dimension 30 x 5, whose columns represent, for each Game of Thrones character, their name, birth date, death date, gender, and culture. Store it as got_df
and print the last 3 rows to the console.# YOUR CODE GOES HERE
got_df
, show that you can compute whether each character is alive or not, and compare this to what is stored in got_chars
, demonstrating that the two ways of checking whether each character is alive lead to equal results.# YOUR CODE GOES HERE
filter()
, print the subset of the rows of got_df
that correspond to Ironborn characters. Then print the subset that correspond to female Northmen.# YOUR CODE GOES HERE
got_df
. Print the results to the console. Hint: what happens if you pass table()
two arguments?# YOUR CODE GOES HERE
group_by()
and summarize()
on got_df
, compute how many characters in each culture have died. Which culture—aside from the unknown category represented by ""—has the most deaths?# YOUR CODE GOES HERE
This is data set from the Rio Olympics data set that we saw in Lab 3. In the next question, we’re going to repeat some calculations from Lab 3 but using dplyr
.
rio = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp/data/rio.csv")
group_by()
and summarize()
, compute how many athletes competed for each country in the rio
data frame? Print the results for the first 10 countries to the console. Building off your here answer, use an additional call to filter()
to compute which country had the most number of athletes and how many that was. Hint: consider using n()
from the dplyr
package for the first part here.# YOUR CODE GOES HERE
group_by()
, summarize()
, and filter()
, compute which country had the most numnber of total medals and many that was.# YOUR CODE GOES HERE
group_by()
, summarize()
, and filter()
, compute which country—among those with zero total medals—had the most number of athletes. Hint: you will need to modify your summarize()
command to compute the number of athletes; and you might need two calls to filter()
.# YOUR CODE GOES HERE
group_by()
, summarize()
, and filter()
, compute the average weight of athletes in each sport, separately for men and women, and report the two sport with the highest average weights (one for each of men and women). Hint: group_by()
can accept more than one grouping variable. Also, consider using na.rm=TRUE
as an additional argument to certain arithmetic summary functions so that they will not be thrown off by NA
or NaN
values.# YOUR CODE GOES HERE
Below, we read two data sets of the 1000 fastest times ever recorded for the 100m sprint, in men’s and women’s track. We scraped this data from http://www.alltime-athletics.com/m_100ok.htm and http://www.alltime-athletics.com/w_100ok.htm, in early September 2021. (Interestingly, the 2nd, 3rd, 4th, 7th, and 8th fastest women’s times were all set at the most recent Tokyo Olympics, or after! Meanwhile, the top 10 men’s times are all from about a decade ago.)
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.txt",
sep="\t", quote="", header=TRUE)
sprint.w.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.w.txt",
sep="\t", quote="", header=TRUE)
sprint.m.df
and sprint.w.df
are data frames. Delete the Rank
column from each data frame, then display the first and last 3 rows of each.# YOUR CODE GOES HERE
Time
column and add them back as a Rank
column to sprint.m.df
. Do the same for the women’s data set. After adding back the rank columns, print out the first 10 rows of each data frame, but only the Time
, Name
, Date
, and Rank
columns. Hint: consider using rank()
.# YOUR CODE GOES HERE
sprint.m.counts
. Do the same for the women’s data set, and call the result sprint.w.counts
. What are the 5 most represented countries, for the men, and for the women? (Interesting side note: go look up the population of Jamaica, compared to that of the US. Pretty impressive, eh?)# YOUR CODE GOES HERE
dplyr
functions, and print out again the 5 most represented countries for men and women. (No need to save new variables.) Hint: consider using arrange()
from the dplyr
library.# YOUR CODE GOES HERE
%in%
operator.# YOUR CODE GOES HERE
dplyr
functions, compute, for each country, the fastest time among athletes who come from that country. Do this for each of the men’s and women’s data sets, and display the first 10 rows of the result.# YOUR CODE GOES HERE
arrange()
.# YOUR CODE GOES HERE
tapply()
gives probably the easiest route here. Note: your code here shouldn’t be too much more complicated than your code in the last part.# YOUR CODE GOES HERE
dplyr
functions, compute, for each country, the quadruple: name, city, country, and time, corresponding to the athlete with the fastest time among athletes from that country. Do this for each of the men’s and women’s data sets, and display the first 10 rows of the result, ordered by increasing time. If there are ties, then show all the results that correspond to the fastest time. Hint: consider using select()
from the dplyr
library.# YOUR CODE GOES HERE
split()
, followed by lapply()
with a custom function call, and then rbind()
to get things in a data frame form. Note: your code here will probably be more complicated, or at least less intuitive, than your code in the last part.# YOUR CODE GOES HERE