Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 1 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 1 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 11. This document contains 13 of the 45 total points for Homework 1.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file. So for Homework 1, e.g., you should paste your solutions (including solutions to homework questions) from Lab 1w, Lab 1f, Lab 2w, and Lab 2f into one big Rmd file, knit it, and submit it.

Reading in text

Hw1 Q6 (4 points). There are two speeches, from Vice Presidential Candidates Mike Pence and Tim Kaine, at the recent 2016 Republican and Democratic National Conventions, up at: http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt and http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt, respectively. Read them into your R session with readLines(), and repeat the above steps: compare the number of lines, display the first 3 and last 3 lines of each speech, delete any blank lines if needed, and compare the total character counts.

Reconstitution

Hw1 Q7 (2 points). Count the total number of characters in trump.words and clinton.words, and call these trump.chartot and clinton.chartot, respectively. (Hint: take advantage of vectorization, and use sum().) Also, count the total number of characters in clinton.lines and clinton.lines, in a similar manner. Explain: why do these two approaches give different total counts for each speech, and why is the first approach preferred?

Basic summaries

Hw1 Q8 (2 points). What is Trump’s total character count (according to trump.chartot) divided by his total word count? And similarly, for Clinton? So, roughly speaking, who uses longer words?

Hw1 Q9 (5 points). Repeat the analysis in the mini-lecture “Summarizing Text”, but applied to Clinton’s speech: use table() to compute word counts, display the first 10 word counts, plot them, count how many times “America”, “great”, “wall”, and “Canada” are mentioned, sort them in decreasing order of frequency, plot a frequency versus rank curve, and overlay the curve from Zipf’s formula with \(C=215\) and \(a=0.57\). Comment on the last plot: does Zipf’s law look to approximately hold, as it did with Trump’s speech?

Hw1 Bonus. Can you find better-fitting parameters \(C,a\) so that Zipf’s law looks even more believable with the Trump and Clinton speeches? What are they, and what evidence do you have that they fit better? (Hint: it may help to look at the frequency versus rank curves on a log-log plot, i.e., use log="xy" as an argument to plot().)