Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 1 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 1 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 11. This document contains 18 of the 45 total points for Homework 1.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file. So for Homework 1, e.g., you should paste your solutions (including solutions to homework questions) from Lab 1w, Lab 1f, Lab 2w, and Lab 2f into one big Rmd file, knit it, and submit it.

Reading text, making basic comparisons

Below, we read in lines from four speeches, from Presidential Candidates Donald Trump and Hillary Clinton, and Vice Presidential Candidates Mike Pence and Tim Kaine, at the recent 2016 Republican and Democratic National Conventions. (You don’t have to do anything yet.)

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
pence.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt")
kaine.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt")

Following the steps for reconstitution from the mini-lecture “Bag-of-Words” (or previous mini-lectures), construct a table of word counts for each speech, call them trump.wordtab, clinton.wordtab, pence.wordtab, and kaine.wordtab. As in this mini-lecture, make sure to convert the words to lower case before creating the tables. Display the first 5 elements from each table.
You might notice that the Trump, Clinton, Pence, and Kaine word tables all have an entry for the empty string “”. As done in the “Bag-of-Words” mini-lecture, get rid of the such entries in these tables.
How many words did each candidate speak? What is the percentage of unique words for each candidate use?
How many times did the Republican candidates use the word “great” total? How many times for the Democratic candidates total? And how about the same comparison, but for “proud”?
Did Pence mention his running mate more, either by first or last name, or did Kaine?

Hw1 Q10 (4 points). Divide the entries in each of the word tables by their respective totals, then multiply by 100. Call the results trump.wordtab.perc, clinton.wordtab.perc, pence.wordtab.perc, and kaine.wordtab.perc. Now the counts become percentages, e.g., Trump used the word “great” 8 times, and used 4384 words total, so the percentage is \(8 / 4384 * 100 \approx 0.182\). Repeat the comparisons in the past two questions; did anything change?

Building the bag-of-words

As in the “Bag-of-Words” mini-lecture, the first step in building our bag-of-words is to collect all of the unique words used in the 4 documents (i.e., 4 speeches, for us). Note: in lecture, we considered a query string, but here we don’t; there is nothing special about the query, it’s just another document. Create a vector of the unique words from the 4 speeches, sorted alphabetically in increasing order, and call the result all.words.unique. (Hint: use the word tables you already computed above, and then names(), unique(), and sort().) How many unique words are there? Display the first 10 and last 10 entries in all.words.unique.
Following the steps in the “Bag-of-Words” mini-lecture, build the document-term matrix, of dimension \(4 \times (\text{# of unique words})\). Assign row names according to the last names of the candidates. Assign column names according to the words. Display all 4 rows and the first 10 columns.
For each word, add the counts together from the two Republican candidates, i.e., add the together the appropriate rows of dt.mat. Call the resulting vector dt.repub. Do the same for the Democratic candidates, call the resulting vector dt.democ. Now, answer: how many times did the Republicans together say “immigration” total, and how many times for the Democrats total?
What happens if you try to answer the last question (how many times total the word “immigration” is used by the Republican versus Democrats), with the individual word tables trump.wordtab, clinton.wordtab, pence.wordtab, and kaine.wordtab? Why do you get such an answer?
Sort each of the vectors dt.repub and dt.democ in decreasing order of word counts, call the results dt.repub.sorted and dt.democ.sorted, respectively. Display the first 8 elements of each. How many of these 8 most common words are shared between the parties?

Hw1 Q11 (2 points). Compute the correlation between the word counts for each pair among the Trump, Clinton, Pence, and Kaine speeches. (Hint: you can do this with a single call to cor(); read the help file to see how cor() handles matrices.) Display these correlations. According to this metric, which two speeches are the most similar, and which are the most different?

Hw1 Q12 (6 points). Search on the web for 4 more speeches from the convention, specifically, the speeches from Newt Gingrich and Melania Trump at the 2016 Republican National Convention, and the speeches from President Barack Obama and Bernie Sanders at the 2016 Democratic National Convention. Copy and paste a transcript of each speech into a text file, and save these on your computer. Then read each one into R using readLines().

As above, construct a word table of counts for each speech, making sure to transform the words to lower case before creating the tables. Remove any empty strings “” from the tables. What is the word count total for the Gingrich, Melania Trump, Obama, and Sanders speeches? Who has the longest speech?

You should now have 8 word tables in total, for the Trump, Clinton, Pence, Kaine, Gingrich, Melania Trump, Obama, and Sanders speeches. As above, make a document-term matrix from these 8 word tables. (Note: you will have to reconstruct the vector of all unique words in the 8 documents.) Call this dt.mat.big. Labels its rows and columns as appropriate. Display all 8 rows and first 10 columns.

Hw1 Q13 (1 point). As a sanity check, show that for each of the first 4 rows of dt.mat.big, the sum of this row (i.e., the total word count) equals that of the corresponding row in dt.mat. (Hint: rowSums() and appropriate indexing of the rows should save you some typing.)

Hw1 Q14 (3 points). Construct two vectors dt.repub.big and dt.democ.big, in a similar fashion to you what did above with dt.repub and dt.democ. That is, for each word, dt.repub.big gives the total count across all 4 Republican candidates, and similarly for dt.democ.big for the 4 Democratic candidates. (Hint: colSums() and appropriate indexing of the columns should save you some typing.) Did the Republican candidates mention Hillary Clinton more (either by first or last name), or did the Democratic candidates mention Donald Trump more (either by first or last name)? Which party mentioned veterans more? Which party mentioned women more?

Hw1 Q15 (2 points). Compute the correlation between the word counts for each of the 8 pairs of speeches. (Hint: as in Hw1 Q11, this should only involve one call to cor().) Display these correlations. Which speeches are the most similar, and which the most different, according to this metric?

Editorial comment

P.S. Do you find it annoying to retype similar commands over and over again in R, like what you had to do to read in all those speeches, and turn them into word tables, etc.? When we learn more about iteration, and also functions, you’ll find these concepts can both save you a lot of excessive typing!

Lab 2f: Bag-of-Words

Statistical Computing, 36-350

Friday September 9, 2016

Reading text, making basic comparisons

Building the bag-of-words

Editorial comment