Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 2 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 2 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday September 18. This document contains 11 of the 45 total points for Homework 2.
Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.
clinton.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt")
clinton.lines = clinton.lines[clinton.lines != ""]
Using a regex literal ang grep()
, display the line numbers in clinton.lines
that contain “Bill”. Also, display the lines of text that contain “Bill”. (Hint: try setting value=FALSE
and value=TRUE
.)
Show, using the line numbers from the last problem from grep()
with value=FALSE
, how the lines of text containing “Bill” can be retrieved from clinton.lines
through direct indexing, i.e., without a call to grep()
with value=TRUE
. Note: this is just to check your understanding of indexing; when it comes to retrieving and displaying matched text, using grep()
with value=TRUE
is easier; you should stick to it as the default, below.
Display the lines of text in clinton.lines
that contain “Bill” or “Chelsea”. Also, display the lines of text that contain “Bill” or “Chelsea” or “Tim”.
Hw2 Q1 (2 points). Display the lines of text in clinton.lines
that contain both of the words “America” and “great” (in any order, separated by any amount of text). Do so only using regexp literals, i.e., no fancy metacharacters, or quantifiers (since we haven’t learned this yet). (Hint: you can use more than one call to grep()
.)
Hw2 Q2 (4 points). Retrieve (but don’t display) the lines of text in clinton.lines
that contain “Trump”. Reconstitute to get the words from these lines, then compute word counts. How many unique words are there? Display the first 5 words and their counts. Also display the top 5 most commonly occurring words and their counts.
Display the lines of text in clinton.lines
that contain “Trump” and then a punctuation mark.
Display the lines of text in clinton.lines
that contain “Trump” and then anything but a punctuation mark.
Display the lines of text in clinton.lines
that contain “Bill” or “Trump”, and then a punctuation mark.
Display the lines of text in clinton.lines
that contain “Bill”, or “Trump” and then a punctuation mark.
Display the lines of text in clinton.lines
that contain three digit numbers.
Display the lines of text in clinton.lines
that contain a number, then “%”, a percent symbol.
Hw2 Q3 (3 points). Display the lines of text in clinton.lines
that contain a number, then a space, then “years”. Also Display the lines of text that contain a number, then a space, then then “years” or “million”. Finally, Display the lines of text that contain a number, then anything but a space or another number.
Display the lines of text in clinton.lines
that contain “!”, an exclamation mark.
Display the lines of text in clinton.lines
that contain “Trump.”, note the period.
Hw2 Q4 (2 points). Consider the string below:
str = "[/\\]"
cat(str)
## [/\]
Note how it prints to the console, the result of cat()
. Design a regex pattern, call it reg
, using escape sequences, to match to this string, and ensure that your pattern works works, by showing that cat(grep(reg, str, value=TRUE))
returns the above string. Explain why you need the number of backslashes that are present in your regex pattern.