Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 17 of the 45 total points for Homework 2.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Splitting with regexes

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
pence.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt")
kaine.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt")

trump.text = paste(trump.lines, collapse=" ")
clinton.text = paste(clinton.lines, collapse=" ")
pence.text = paste(pence.lines, collapse=" ")
kaine.text = paste(kaine.lines, collapse=" ")

split.pattern = "PATTERN GOES HERE"
trump.words = strsplit(trump.text, split=split.pattern)[[1]]
clinton.words = strsplit(clinton.text, split=split.pattern)[[1]]
pence.words = strsplit(pence.text, split=split.pattern)[[1]]
kaine.words = strsplit(kaine.text, split=split.pattern)[[1]]

trump.words = tolower(trump.words[trump.words != ""])
clinton.words = tolower(clinton.words[clinton.words != ""])
pence.words = tolower(pence.words[pence.words != ""])
kaine.words = tolower(kaine.words[kaine.words != ""])

trump.wordtab = table(trump.words)
clinton.wordtab = table(clinton.words)
pence.wordtab = table(pence.words)
kaine.wordtab = table(kaine.words)

Hw2 Q9 (1 point). For each of the four word tables, remove the entries that are for words containing any numeric characters.

Hw2 Q10 (2 points). For each of the four word tables, which has the longest word (most number of characters) and what was it? (Hint: max() and which.max() will be useful.)

Hw2 Q11 (7 points). Create a bag-of-words model, as we did in the “Bag-of-Words” mini-lecture and Lab 2f. Specifically, do the following. Calculate the total unique words between the 4 speeches. Create the document-term matrix, of dimension \(4 \times (\text{# of unique words})\). Assign row names according to the last names of the candidate, assign column names according to the words. Display all 4 rows and the first 20 columns. Compute the correlations between word counts for each pair of speeches. For each speaker, answer: which other speech was most correlated with his/her speech, and does this make sense to you?

Problems with splitting

trial.str.vec = c("we've.", "I'll ", "don't!")

Hw2 Bonus. Come up with a strategy that splits punctuation marks or spaces, except, it keeps intact words like “I’ve” or “wasn’t”, that have a punctuation mark in the middle, in between two letters. Implement it on Trump’s speech, and display the results, verifying that the only punctuation marks leftover are those in between two letters.

Searching with regexes

sprint.lines = readLines("http://www.alltime-athletics.com/m_100ok.htm")
i.first = min(grep("Usain Bolt", sprint.lines))
i.last = max(grep("Julian Forte", sprint.lines))
sprint.lines = sprint.lines[i.first:i.last] # Throw away everything else

Hw2 Q12 (7 points). Now extract the years in which the sprint times were recorded (this is the last column). (Hint: it will help to use anchoring here; this is because there are actually two dates occurring on each line; the first is the birth date of the sprinter.) Call the resulting vector sprint.years, and check it has length 2841.

Use sprint.years to answer the following questions. What percentage of the 2841 total sprint times were recorded in 2000 or later? In what year were the most sprint times recorded? Lastly, plot the vector of actual times sprint.times (recreated from the “Spliting and Searching with Regexes” mini-lecture, below, for your convenience) versus the vector of years sprint.years. Do you notice a trend?

Hw2 Bonus. What percentage of the 2841 total sprint times were recorded in June, July, or August?