Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 17 of the 45 total points for Homework 2.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Splitting with regexes

Below, we read in lines from speeches by Trump, Clinton, Pence, and Kaine at the recent Republican/Democratic National Conventions. All you have to do is define a regex pattern, in split.pattern, that matches a space or a punctuation mark, as in the “Splitting and Searching with Regexes” mini-lecture. Then the rest is taken care of for you: reconstitution, by splitting spaces or punctuation marks, getting rid of empty strings, converting words to lowercase, and computing word counts.

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
pence.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt")
kaine.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt")

trump.text = paste(trump.lines, collapse=" ")
clinton.text = paste(clinton.lines, collapse=" ")
pence.text = paste(pence.lines, collapse=" ")
kaine.text = paste(kaine.lines, collapse=" ")

split.pattern = "PATTERN GOES HERE"
trump.words = strsplit(trump.text, split=split.pattern)[[1]]
clinton.words = strsplit(clinton.text, split=split.pattern)[[1]]
pence.words = strsplit(pence.text, split=split.pattern)[[1]]
kaine.words = strsplit(kaine.text, split=split.pattern)[[1]]

trump.words = tolower(trump.words[trump.words != ""])
clinton.words = tolower(clinton.words[clinton.words != ""])
pence.words = tolower(pence.words[pence.words != ""])
kaine.words = tolower(kaine.words[kaine.words != ""])

trump.wordtab = table(trump.words)
clinton.wordtab = table(clinton.words)
pence.wordtab = table(pence.words)
kaine.wordtab = table(kaine.words)

Check, as we did in the “Splitting and Searching with Regexes” mini-lecture, that the words we got from splitting the Trump, Clinton, Pence, and Kaine speeches don’t contain any punctuation marks.
For each of the four word tables, count how many of the entries are for words that contain any numeric characters, i.e., characters between 0 and 9.

Hw2 Q9 (1 point). For each of the four word tables, remove the entries that are for words containing any numeric characters.

Hw2 Q10 (2 points). For each of the four word tables, which has the longest word (most number of characters) and what was it? (Hint: max() and which.max() will be useful.)

Hw2 Q11 (7 points). Create a bag-of-words model, as we did in the “Bag-of-Words” mini-lecture and Lab 2f. Specifically, do the following. Calculate the total unique words between the 4 speeches. Create the document-term matrix, of dimension \(4 \times (\text{# of unique words})\). Assign row names according to the last names of the candidate, assign column names according to the words. Display all 4 rows and the first 20 columns. Compute the correlations between word counts for each pair of speeches. For each speaker, answer: which other speech was most correlated with his/her speech, and does this make sense to you?

Problems with splitting

Let’s revisit trump.wordtab. How many appearances of the word “ve”? How many appearances of “ll” are there? How many appearances of “t”? Explain why these (which aren’t really words used by Trump in his speech) occur at all. (Hint: look at lines 23 or 86 of trump.lines.)
Design a regex to match all patterns of the following form: any letter (1 or more), then a punctuation mark, then “ve” or “ll” or “t”, then a space or a punctuation mark. Call it my.pattern. Check, with grep(), that it matches all the strings below.

trial.str.vec = c("we've.", "I'll ", "don't!")

Use my.pattern, along with regexpr() and regmatches(), to extract all the occurrences of such a pattern in trump.lines. Count how many resulting strings you get back. Are there 7 total (which should be the total from the word table of occurrences of “ve”, “ll”, and “t”, from two questions back)? There shouldn’t be. Why not? (Hint: look at line 23 of trump.lines again. What do you notice, with “I’ve” and “wasn’t”? Why is the first captured, but not the second?)
Use my.pattern, now along with with gregexpr() and regmatches(), to extract all occurrences of such a pattern in trump.lines. Remember this gets all occurrences (not just the first one on each line). Call my.list the return value from regmatches(). Note that it is a list of length 113, the length of trump.lines, and most entries are empty. Do not print it out to the console; instead, show the result of unlist(my.list). Does this have length 7? It should.
Just a comment (you don’t have to do anything here). In speeches like Trump’s (and the others’), we generally want to split this into words, getting rid of punctuation marks as appropriate, but we don’t want to split up words like “I’ve”, “wasn’t”, “can’t”, “we’ll”, “wouldn’t”. This is actually not all that easy, but still, it can be done with more advanced regexes or with a multi-step splitting process.

Hw2 Bonus. Come up with a strategy that splits punctuation marks or spaces, except, it keeps intact words like “I’ve” or “wasn’t”, that have a punctuation mark in the middle, in between two letters. Implement it on Trump’s speech, and display the results, verifying that the only punctuation marks leftover are those in between two letters.

Searching with regexes

Below, we read in text from the “Splitting and Searching with Regexes” mini-lecture on the 2841 fastest men’s 100m sprint times. We retain only the lines that correspond to the sprint data. (You don’t have to do anything yet.)

sprint.lines = readLines("http://www.alltime-athletics.com/m_100ok.htm")
i.first = min(grep("Usain Bolt", sprint.lines))
i.last = max(grep("Julian Forte", sprint.lines))
sprint.lines = sprint.lines[i.first:i.last] # Throw away everything else

By following steps analogous to those we took in order to extract the actual sprint times themselves, extract the countries of origin (for the sprinters). Call the resulting vector sprint.countries, and check that it has length 2841.
What country has the most appearances in this list of 2841, and how many appearances? List the next 4 most commonly occurring countries, along with their appearances. What percentage of the 2841 total times are accounted for by USA and Jamaica (combined)?

Hw2 Q12 (7 points). Now extract the years in which the sprint times were recorded (this is the last column). (Hint: it will help to use anchoring here; this is because there are actually two dates occurring on each line; the first is the birth date of the sprinter.) Call the resulting vector sprint.years, and check it has length 2841.

Use sprint.years to answer the following questions. What percentage of the 2841 total sprint times were recorded in 2000 or later? In what year were the most sprint times recorded? Lastly, plot the vector of actual times sprint.times (recreated from the “Spliting and Searching with Regexes” mini-lecture, below, for your convenience) versus the vector of years sprint.years. Do you notice a trend?

Hw2 Bonus. What percentage of the 2841 total sprint times were recorded in June, July, or August?

Lab 3f: Splitting and Searching with Regexes

Statistical Computing, 36-350

Friday September 16, 2016

Splitting with regexes

Problems with splitting

Searching with regexes