Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 3 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 3 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 25. This document contains 12 of the 45 total points for Homework 3.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Histograms

clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")

Hw3 Q5 (4 points). Below we read in a data table, on the 2829 fastest 100m sprint times, saved as sprint.tab. We also extract the years in which the sprint times were recorded from the data table, saved as sprint.years. Now, similarly, extract the birth years of the sprinters from the data table. To do so, define sprint.bdates to be the 6th column of sprint.tab, converted into a character vector. Then define a character vector sprint.byears to contain the last 2 characters of each entry of sprint.bdates. Convert sprint.byears into a numeric vector, add 1900 to each entry, and redefine sprint.byears to be the result.

Next, compute a vector sprint.ages containing the age (in years) of each sprinter when their sprint time was recorded. (Hint: use sprint.byears and sprint.years.) Then plot a histogram of the ages of the 2829 sprinters in the data table, with break locations occuring at every age in between 17 and 40. Color the histogram to your liking; label the x-axis, and title the histogram appropriately. What is the mode, i.e., the most common age? Also, describe what you see around the mode: do we see more sprinters who are younger, or older?

sprint.tab = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                        sep="\t", quote="", header=TRUE)
sprint.dates = as.character(sprint.tab[,8])
sprint.years = substr(sprint.dates, nchar(sprint.dates)-3, nchar(sprint.dates))
sprint.years = as.numeric(sprint.years)

Hw3 Q6 (8 points). A data table on the 2018 fastest women’s 100m times is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat. This has precisely the same format as the 2829 fastest men’s 100m times you have already analyzed (just with fewer rows). Following the same strategy as in the previous question, compute a vector sprint.w.ages containing the age (in years) of each woman sprinter when their sprint time was recorded. Plot a histogram of sprint.ages, overlaid with a histogram of sprint.w.ages, both being on the probability scale. Set the break locations so that the plot captures the full range of the very youngest to the very oldest sprinter present among both men and women. Choose colors of your liking, but use transparency as appropriate so that the shapes of both histograms are visible; label the x-axis, and title the histogram appropriately. Add a legend to the histogram, identifying the histogram bars from the men and women. Compare, roughly, the shapes of the two histograms: is there a difference between the age distributions of the world’s fastest men and fastest women?

Heatmaps