Name:
Andrew ID:
Collaborated with:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.

This week’s agenda: getting familiar with data frames; practicing how to use the apply family of functions.

States data set

Below we construct a data frame, of 50 states x 10 variables. The first 8 variables are numeric and the last 2 are factors. The numeric variables here come from the built-in state.x77 matrix, which records various demographic factors on 50 US states, measured in the 1970s. You can learn more about this state data set by typing ?state.x77 into your R console.

state.df = data.frame(state.x77, Region=state.region, Division=state.division)

Basic data frame manipulations

Prostate cancer data set

Let’s return to the prostate cancer data set that we looked in Week 2 (taken from the book The Elements of Statistical Learning). Below we read in a data frame of 97 men x 9 variables. You can remind yourself about what’s been measured by looking back at the lab/homework (or by visiting the URL linked above in your web browser, clicking on “Data” on the left-hand menu, and clicking “Info” under “Prostate”).

pros.dat = 
  read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/pros.dat")

Practice with the apply family

t.test.by.ind = function(x, ind) {
  stopifnot(all(ind %in% c(0, 1)))
  return(t.test(x[ind == 0], x[ind == 1]))
}

Rio Olympics data set

Now we’re going to examine data from the 2016 Summer Olympics in Rio de Janeiro, taken from https://github.com/flother/rio2016 (itself put together by scraping the official Summer Olympics website for information about the athletes). Below we read in the data and store it as rio.

rio = read.csv("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/rio.csv")

More practice with data frames and apply

Young and old folks

Add the age variable to the rio data frame. variable Who is the oldest athlete, and how old is he/she? Youngest athlete, and how old is he/she? In the case of ties, here, display all the relevant athletes.

Sport by sport