Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 3 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 3 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 25. This document contains 12 of the 45 total points for Homework 3.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Histograms

Below, we read in lines from Hillary Clinton’s speech at the 2016 Democratic National Convention. Follow the steps of the “Histogram and Heatmaps” mini-lecture (which were themselves a review of what we learned in our mini-lectures on text and regular expressions) to split up the speech into words, called clinton.words, and compute word lengths, called clinton.wlens.

clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")

Produce a histogram of Clinton’s word lengths, on the frequency scale. Set the break locations in the histogram to occur between at every number 0 and 20. Color the histogram bars to be blue. Set the x-axis to be “Word lengths”, and the title to be “Clinton word lengths”.
Below, we read in lines from Donald Trump’s speech at the 2016 Republican National Convention. As we did the “Histogram and Heatmaps” mini-lecture (and above with Trump’s speech), split up the Trump’s speech into words, called trump.words, and compute word lengths, called trump.wlens. Reproduce your histogram of Clinton’s word lengths, from the previous question, but with the title changed to be “Clinton versus Trump word lengths”. Overlaid on top of this histogram, plot a histogram of Trump’s word lengths, again on the frequency scale, with the same break locations. For Trump’s histogram bars, color them in transparent red (hint: recall rgb()).

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")

Why isn’t it a good idea to compare the Clinton and Trump histograms on the frequency scale (as we did in the last question)?
Modify your plot of Clinton’s word lengths versus Trump’s, so that both histograms are on the probability (rather than frequency) scale. Add a legend to the plot, with text: “Clinton” and “Trump”, and corresponding symbols: a blue box, a transparent red box, respectively. (Hint: check out the fill argument.) Why does the probability scale make more sense for the comparison? And, who appears to use longer words more often, Clinton or Trump?
As we’re statisticians, naturally, we can test for equality of these two histograms—more precisely, equality of the distributions of word lengths used by Clinton and Trump. Run the command chisq.test(x=c(clinton.wlens,trump.wlens), y=c(rep(0,length(clinton.wlens)), rep(1,length(trump.wlens)))). (You can ignore a warning about the chi-squared approximation possibly being incorrect.) This computes a chi-squared test for equality of the two word length distributions. What is the p-value, and what does this suggest about the two distributions? We will (likely) revisit the chi-squared test when we cover simulation, later in the course.

Hw3 Q5 (4 points). Below we read in a data table, on the 2829 fastest 100m sprint times, saved as sprint.tab. We also extract the years in which the sprint times were recorded from the data table, saved as sprint.years. Now, similarly, extract the birth years of the sprinters from the data table. To do so, define sprint.bdates to be the 6th column of sprint.tab, converted into a character vector. Then define a character vector sprint.byears to contain the last 2 characters of each entry of sprint.bdates. Convert sprint.byears into a numeric vector, add 1900 to each entry, and redefine sprint.byears to be the result.

Next, compute a vector sprint.ages containing the age (in years) of each sprinter when their sprint time was recorded. (Hint: use sprint.byears and sprint.years.) Then plot a histogram of the ages of the 2829 sprinters in the data table, with break locations occuring at every age in between 17 and 40. Color the histogram to your liking; label the x-axis, and title the histogram appropriately. What is the mode, i.e., the most common age? Also, describe what you see around the mode: do we see more sprinters who are younger, or older?

sprint.tab = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.dat",
                        sep="\t", quote="", header=TRUE)
sprint.dates = as.character(sprint.tab[,8])
sprint.years = substr(sprint.dates, nchar(sprint.dates)-3, nchar(sprint.dates))
sprint.years = as.numeric(sprint.years)

Hw3 Q6 (8 points). A data table on the 2018 fastest women’s 100m times is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/sprint.w.dat. This has precisely the same format as the 2829 fastest men’s 100m times you have already analyzed (just with fewer rows). Following the same strategy as in the previous question, compute a vector sprint.w.ages containing the age (in years) of each woman sprinter when their sprint time was recorded. Plot a histogram of sprint.ages, overlaid with a histogram of sprint.w.ages, both being on the probability scale. Set the break locations so that the plot captures the full range of the very youngest to the very oldest sprinter present among both men and women. Choose colors of your liking, but use transparency as appropriate so that the shapes of both histograms are visible; label the x-axis, and title the histogram appropriately. Add a legend to the histogram, identifying the histogram bars from the men and women. Compare, roughly, the shapes of the two histograms: is there a difference between the age distributions of the world’s fastest men and fastest women?

Heatmaps

The volcano object in R is a matrix of dimension 87 x 61. It is a digitized version of a topographic map of the Maungawhau volcano in Auckland, New Zealand. Plot a heatmap of the volcano, with 25 colors from the terrain color scale.
Each row of volcano corresponds to a grid line running east to west. Each column of volcano corresponds to a grid line running south to north. Define a matrix volcano.rev by reversing the order of the rows, as well as the order of the columns, of volcano. Therefore, each row volcano.rev should now correspond to a grid line running west to east, and each column of volcano.rev a grid line running north to south.
If we printed out the matrix volcano.rev to the console, then the elements would follow proper geographic order: left to right means west to east, and top to bottom means north to south. Now, produce a heatmap of the volcano that follows the same geographic order. (Hint: recall that the image() function rotates a matrix 90 degrees counterclockwise before displaying it; and recall the function clockwise90() from the “Histograms and Heatmaps” mini-lecture.) Label the x-axis “West –> East”, and the y-axis “South –> North”. Title the plot “Heatmap of Maungawhau Volcano”.
Reproduce the previous plot, now drawing contour lines on top of the heatmap.
The function filled.contour() provides an alternative way to create a heatmap with contour lines on top. It uses the same orientation as image() when plotting a matrix. Use filled.contour() to plot a heatmap of the volcano, with (light) contour lines automatically included. Make sure the orientation of the plot matches proper geographic orientation, as in the previous question. Use a color scale of your choosing (hint: it might help to take a look at the help file for filled.contour()), and label the axes and title the plot appropriately.

Lab 4w: Histograms and Heatmaps

Statistical Computing, 36-350

Wednesday September 21, 2016

Histograms

Heatmaps