Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 4 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 4 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 2. This document contains 17 of the 45 total points for Homework 4.

You guessed it, Huber function!

huber = function(x, a=1) {
  x.squared = x^2
  ifelse(abs(x) <= a, x.squared, 2*a*abs(x)-a^2)
}
huber.sloppy = function(x) {
  ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}

Hw4 Q8 (1 point). At last, a difference between = and <-, explained! Many of you have asked about this. The equal sign = and assignment operator <- are often used interchangeably in R, and some people will often say that a choice between the two is mostly a matter of stylistic taste. This is not the full story. Indeed, = and <- behave very differently when used to set input arguments in a function call. As we showed above, setting, say, a=5 as the input to huber() has no effect on the global assignment for a. Just to demonstrate this once more:

a = 10
huber(x=3, a=5)
## [1] 9
a
## [1] 10

However, replacing a=5 with a<-5 in the call to huber() leads to a very different result, in that does affect the global assignment for a. Do so, and show that this is indeed the case.

Hw4 Q9 (1 point). The story now gets even more subtle. It turns out that the assignment operator <- allows us to define global variables even when we are specifying inputs to a function. Pick a variable name that has not been defined yet in your workspace, say b (or something else, if this has already been used in your R Markdown document). Call huber(x=3, b<-20). Then display the value of b: this variable should now exist in the global enviroment, and it should be equal to 20!

Hw4 Q10 (2 points). The property of the assignment operator <- demonstrated in the last question, although tricky, can also be pretty useful. Leverage this property to plot the function \(y=0.05x^2 - \sin(x)\cos(x) + 0.1\exp(1+\log(x))\) over 50 x values between 0 and 2, using only one line of R code and one call to the function seq() (or one use of the colon operator :).

Hw4 Q11 (2 points). Finally, give an example to show that the property of the assignment operator <- demonstrated in the last two questions does not hold in the body of a function. That is, give an example in which <- is used in the body of a function to define a variable, but this doesn’t translate into global assignment.

Get document-term matrix function

get.dt.mat = function(str.urls, split="[[:space:]]|[[:punct:]]",
                      tolower=TRUE, keep.numbers=FALSE) {
  # First, compute all the individual word tables
  wordtabs = get.wordtabs(str.urls, split, tolower, keep.numbers)
  # Then, build the document-term matrix from these, and return it
  return(dt.mat.from.wordtabs(wordtabs))
}

get.wordtabs = function(str.urls, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  wordtabs = list()
  for (i in 1:length(str.urls)) {
    wordtabs[[i]] = get.wordtab(str.urls[i], split, tolower, keep.numbers)
  }
  return(wordtabs)
}

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  # Get rid of words with numbers, if we're asked to
  if (!keep.numbers)
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  table(words)
}
wordtabs = get.wordtabs(c(
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt"))

Hw4 Q12 (6 points). Now let’s complete the second task: building the document-term matrix row by row. Again, using the example list of word tables wordtabs from the previous question, write code to build the document-term matrix. (Hint: it will again help to use a for() loop.) Name the rows of the document-term matrix “Candidate 1” through “Candidate 4”. The reason for this will be clear shortly. Show all 4 rows and the first 10 columns of the document-term matrix.

Hw4 Q13 (5 points). Finally, define the function dt.mat.from.wordtabs(), by substituting the code you wrote in the last two questions into the initial code sketch you wrote. This function should produce a document-term matrix with row names being “Candidate 1” through “Candidate N”, where N is the total number of documents being considered. With this, you should now be able to run get.dt.mat(). Run this function on the speeches from Trump, Clinton, Pence, and show all 3 rows and the first 25 columns of the resulting document-term matrix.

Hw4 Bonus. Modify the above functions as appropriate so that the row names of the document-term matrix, of the form “Candidate 1” through “Candidate N”, are replaced by something more descriptive, that better reflects the identity of the documents. For example, look back at Hw4 Q5, where you set a clever default y label—the same strategy could be used here. Run your modification on an example to show what it produces.