Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 1 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 1 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 11. This document contains 13 of the 45 total points for Homework 1.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file. So for Homework 1, e.g., you should paste your solutions (including solutions to homework questions) from Lab 1w, Lab 1f, Lab 2w, and Lab 2f into one big Rmd file, knit it, and submit it.

Reading in text

There are two speeches, from Presidential Candidates Donald Trump and Hillary Clinton, at the recent 2016 Republican and Democratic National Conventions, up at: http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt and http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt, respectively. Read these into your R session with the readLines() function, saved as trump.lines and clinton.lines, respectively. How many lines does each speech have?
It looks as if the Clinton speech is much longer than the Trump speech. Display the first 3 lines from each speech, and the last 3 lines from each speech. (Hint: use head() and tail().) What does this tell you, i.e., do you still think the Clinton speech is obviously much longer?
How many blank lines are there in trump.lines i.e., how many lines are equal to the empty string “”? How many in clinton.lines? (Hint: use a logical comparison, and then sum().)
Delete the blank lines in trump.lines, i.e., redefine this character vector so that it doesn’t contain any blank lines. (Hint: there are many ways to do this, but perhaps the simplest is to index using a logical comparison.) As a sanity check, compute the total number of lines in the new trump.lines, and confirm that it is what you’d expect given your answers to the previous two questions. Do the same for clinton.lines. Do the speeches now appear closer in length?

Hw1 Q6 (4 points). There are two speeches, from Vice Presidential Candidates Mike Pence and Tim Kaine, at the recent 2016 Republican and Democratic National Conventions, up at: http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt and http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt, respectively. Read them into your R session with readLines(), and repeat the above steps: compare the number of lines, display the first 3 and last 3 lines of each speech, delete any blank lines if needed, and compare the total character counts.

Reconstitution

Follow the steps for reconstitution from the mini-lecture “Reading Text”, for each speech. That is, for trump.lines, collapse this into one long string, call it trump.text; then split on spaces to retrieve a vector of individual words, call it trump.words (note: the result of strsplit() is always a list, here we are looking for the vector that occupies the first element of this list). Do the same for clinton.lines, calling the results clinton.text and clinton.words. Compute the total word counts for the two speeches, call them trump.wordtot and clinton.wordtot, respectively. What are the results, which is longer?

Hw1 Q7 (2 points). Count the total number of characters in trump.words and clinton.words, and call these trump.chartot and clinton.chartot, respectively. (Hint: take advantage of vectorization, and use sum().) Also, count the total number of characters in clinton.lines and clinton.lines, in a similar manner. Explain: why do these two approaches give different total counts for each speech, and why is the first approach preferred?

Basic summaries

Hw1 Q8 (2 points). What is Trump’s total character count (according to trump.chartot) divided by his total word count? And similarly, for Clinton? So, roughly speaking, who uses longer words?

Calculate the unique words used in the Trump and Clinton speeches, calling the results trump.words.unique and clinton.words.unique, respectively. Sort each vector so that it is in increasing alphabetical order. Then display the first 5 elements of each.
Using trump.words.unique and clinton.words.unique, and the total word counts from each of the Trump and Clinton speeches, calculate the percentage of unique words in each of the speeches. Who repeats themselves more often?
Split the vectors trump.text and clinton.text on the empty string, giving you two vectors of character sequences for these strings, call them trump.chars and clinton.chars, respectively. Transform each of trump.chars and clinton.chars so that all entries are in lower case. Display the first 5 elements of each.
Use table() to get counts of per-character occurrences for trump.chars and clinton.chars, call the results trump.chartab and clinton.chartab, respectively.
Sort trump.chartab in decreasing order, and display the 10 most commonly occuring characters. Do the same for clinton.chartab. Is there an overlap here?
Plot the tables trump.chartab and clinton.chartab. Do they look similar?

Hw1 Q9 (5 points). Repeat the analysis in the mini-lecture “Summarizing Text”, but applied to Clinton’s speech: use table() to compute word counts, display the first 10 word counts, plot them, count how many times “America”, “great”, “wall”, and “Canada” are mentioned, sort them in decreasing order of frequency, plot a frequency versus rank curve, and overlay the curve from Zipf’s formula with \(C=215\) and \(a=0.57\). Comment on the last plot: does Zipf’s law look to approximately hold, as it did with Trump’s speech?

Hw1 Bonus. Can you find better-fitting parameters \(C,a\) so that Zipf’s law looks even more believable with the Trump and Clinton speeches? What are they, and what evidence do you have that they fit better? (Hint: it may help to look at the frequency versus rank curves on a log-log plot, i.e., use log="xy" as an argument to plot().)

Lab 2w: Reading in Text, Summarizing Text

Statistical Computing, 36-350

Wednesday September 7, 2016

Reading in text

Reconstitution

Basic summaries