Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 11 of the 45 total points for Homework 2.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Regexp literals

clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
clinton.lines = clinton.lines[clinton.lines != ""]

Hw2 Q1 (2 points). Display the lines of text in clinton.lines that contain both of the words “America” and “great” (in any order, separated by any amount of text). Do so only using regexp literals, i.e., no fancy metacharacters, or quantifiers (since we haven’t learned this yet). (Hint: you can use more than one call to grep().)

Hw2 Q2 (4 points). Retrieve (but don’t display) the lines of text in clinton.lines that contain “Trump”. Reconstitute to get the words from these lines, then compute word counts. How many unique words are there? Display the first 5 words and their counts. Also display the top 5 most commonly occurring words and their counts.

Metacharacters

Hw2 Q3 (3 points). Display the lines of text in clinton.lines that contain a number, then a space, then “years”. Also Display the lines of text that contain a number, then a space, then then “years” or “million”. Finally, Display the lines of text that contain a number, then anything but a space or another number.

Escape sequences

Hw2 Q4 (2 points). Consider the string below:

str = "[/\\]"
cat(str)
## [/\]

Note how it prints to the console, the result of cat(). Design a regex pattern, call it reg, using escape sequences, to match to this string, and ensure that your pattern works works, by showing that cat(grep(reg, str, value=TRUE)) returns the above string. Explain why you need the number of backslashes that are present in your regex pattern.