Name:
Andrew ID:
Collaborated with:

On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.

States data set

Below we construct a data frame, of 50 states x 10 variables. The first 8 variables are numeric and the last 2 are factors. The numeric variables here come from the built-in state.x77 matrix, which records various demographic factors on 50 US states, measured in the 1970s. You can learn more about this state data set by typing ?state.x77 into your R console.

state.df <- data.frame(state.x77, Region=state.region, Division=state.division)

Basic data frame manipulations

1a. Add a column to state.df, containing the state abbreviations that are stored in the built-in vector state.abb. Name this column Abbr. You can do this in (at least) two ways: by using a call to data.frame(), or by directly defining state.df$Abbr. Display the first 3 rows and all 11 columns of the new state.df.
1b. Remove the Region column from state.df. You can do this in (at least) two ways: by using negative indexing, or by directly setting state.df$Region to be NULL. Display the first 3 rows and all 10 columns of state.df.
1c. Add two columns to state.df, containing the x and y coordinates (longitude and latitude, respectively) of the center of the states, that are stored in the (existing) list state.center. Hint: take a look at this list in the console, to see what its elements are named. Name these two columns Center.x and Center.y. Display the first 3 rows and all 12 columns of state.df.
1d. Make a new data.frame which contains only those states whose longitude is less than -100. Do this in two different ways and check that they are equal to each other, using an appropriate function call.

Prostate cancer data set

Let’s return to the prostate cancer data set that we looked at in the lab/homework from Week 2 (taken from the book The Elements of Statistical Learning). Below we read in a data frame of 97 men x 9 variables. You can remind yourself about what’s been measured by looking back at the lab/homework (or by visiting the URL linked above in your web browser, clicking on “Data” on the left-hand menu, and clicking “Info” under “Prostate”).

pros.dat <- 
  read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/pros.dat")

Practice with the apply family

2a. Using sapply(), calculate the mean of each variable. Also, calculate the standard deviation of each variable. Each should require just one line of code. Display your results.
2b. Let’s plot each variable against SVI. Using lapply(), plot each column, excluding SVI, on the y-axis with SVI on the x-axis. This should require just one line of code. Challenge: label the y-axes in your plots appropriately. Your solution should still consist of just one line of code and use an appropriate apply function. Hint: for this part, consider using mapply().
2c. Now, use lapply() to perform t-tests for each variable in the data set, between SVI and non-SVI groups. To be precise, you will perform a t-test for each variable excluding the SVI variable itself. For convenience, we’ve defined a function t.test.by.ind() below, which takes a numeric variable x, and then an indicator variable ind (of 0s and 1s) that defines the groups. Run this function on the columns of pros.dat, and save the result as tests. What kind of data structure is tests? Print it to the console.

t.test.by.ind <- function(x, ind) {
  stopifnot(all(ind %in% c(0, 1)))
  return(t.test(x[ind == 0], x[ind == 1]))
}

Challenge. Using an appropriate apply function again, extract the p-values from the tests object you created in the last question, with just a single line of code. Hint: run the command "[["(pros.dat, "lcavol") in your console—what does this do?

Rio Olympics data set

It’s Winter Olympics time! To get into the Olympics spirit, we’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from https://github.com/flother/rio2016 (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio.

rio <- read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/rio.csv")

More practice with data frames and apply

3a. Call summary() on rio and display the result. Is there any missing data?
3b. Use rio to answer the following questions. How many athletes competed in the 2016 Summer Olympics? How many countries were represented? What were these countries, and how many athletes competed for each one? Which country brought the most athletes, and how many was this?
3c. How many medals of each type—gold, silver, bronze—were awarded at this Olympics? Is this result surprising, and can you explain what you are seeing?
3d. Create a column called total which adds the number of gold, silver, and bronze medals for each athlete, and add this column to rio. Which athlete had the most number of medals and how many was this? Which athlete had the most silver medals and how many was this? (Ouch! So close, so many times …) In the case of ties, here, display all the relevant athletes.
3e. Using tapply(), calculate the total medal count for each country. Save the result as total.by.nat, and print it to the console. Which country had the most number of medals, and how many was this? How many countries had zero medals? Challenge: among the countries that had zero medals, which had the most athletes, and how many athletes was this? (Ouch!)

Some advanced practice with apply

4a. The variable date_of_birth contains strings of the date of birth of each athlete. Use text processing commands to extract the year of birth, and create a new numeric variable called age, equal to 2016 - (the year of birth). (Here we’re ignoring days and months for simplicity.) Add the age variable to the rio data frame. variable Who is the oldest athlete, and how old is he/she? Youngest athlete, and how old is he/she? In the case of ties, here, display all the relevant athletes. Challenge: Answer the same questions, but now only among athletes who won a medal.
4b. Using an appropriate apply function, answer: how old is the oldest athlete, for each sport? How old is the youngest, for each sport? Challenge: determine the names of the oldest and youngest athletes in each sport. No explicit iteration allowed. In the case of ties, just return one relevant athlete name.
4c. Create a new data.frame called sports, which we’ll populate with information about each sporting event at the Summer Olympics. Initially, define sports to contain a single variable called sport which contains the names of the sporting events in alphabetical order. Then, add a column called n_participants which contains the number of participants in each sport. Use one of the apply functions to determine the number of gold medals given out for each sport, and add this as a column called n_gold. Using your newly created sports data frame, calculate the ratio of the number of gold medals to participants for each sport. Which sport has the highest ratio? Which has the lowest?

Homework 3: Data Frames and Apply

Statistical Computing, 36-350