Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 4 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 4 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 2. This document contains 16 of the 45 total points for Homework 4.

Huber function

huber = function(x, a=1) {
  ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}

Hw4 Q4 (3 points). Further modify your huber() function so that, as another side effect, it produces a plot of Switzerland’s national flag. (Hint: look it up on Google; you should be able to do this using a few calls to rect().) Call your function on an input of your choosing, to demonstrate its side effects.

Get word table(s) function

# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page 
# - split: string, specifying what to split on. Default is the regex pattern
#   "[[:space:]]|[[:punct:]]"
# - tolower: boolean, TRUE if words should be converted to lower case before
#   the word table is computed. Default is TRUE
# - keep.numbers: boolean, TRUE if words containing numbers should be kept in
#   the word table. Default is FALSE
# - plot.wordtab: boolean, TRUE if word table should be plotted as a side 
#   effect. Default is FALSE
# Output: list, containing word table, and then some basic numeric summaries

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE, plot.wordtab=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  # Get rid of words with numbers, if we're asked to
  if (!keep.numbers)
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  # Compute the word table
  wordtab = table(words)
  
  # Plot the word table, if we're asked to
  if (plot.wordtab) plot(wordtab)
  
  return(list(wordtab=wordtab,
              number.unique.words=length(wordtab),
              number.total.words=sum(wordtab)))
}

Hw4 Q5 (5 points). Modify the definition of get.wordtab() from the previous question, so that when ylab is NULL (and plot.wordtab is TRUE), a more clever y label is used as the default. For good practice, also modify the documentation as appropriate. In particular, when the URL string in str.url is of the form: http://www.someurl.com/something/else/filename.txt, we will want the default y label to be “filename.txt word counts”. Thus the main task is to extract the last bit, of the form “filename.txt”, from str.url (hint: to do so, find the last occurence of the character “/”, and extract the text after this character until the end of the string), and then we can just paste “word counts” after this bit.

To be clear: here we are just describing the default y label, in the case that the user doesn’t specify his/her own y label; when ylab is not NULL, of course, the specified y label should be used. Once you have modified get.wordtab(), call this function on Trump’s speech with plot.wordtab=TRUE and ylab="Trump's word counts", and then again on Trump’s speech with plot.wordtab=TRUE and no specification for ylab.

my.fun.1 = function() { return(1:10) }
my.fun.2 = function() { invisible(1:10) }

my.fun.1()
##  [1]  1  2  3  4  5  6  7  8  9 10
a = my.fun.1()
a
##  [1]  1  2  3  4  5  6  7  8  9 10
my.fun.2() # Note that nothing is printed to the console!
b = my.fun.2()
b
##  [1]  1  2  3  4  5  6  7  8  9 10

Hence you can see that, with invisible(), unless the user explicitly defines a variable to be the result of the function call, the returned object is not printed to the console. Modify get.wordtab() so that, when plot.wordtab is TRUE, the output is returned invisibly. (When plot.wordtab is FALSE, the output should be returned as usual.) Demonstrate that your modification worked by testing it out on Trump’s speeches in a few cases.

Hw4 Q6 (4 points). Recall your function get.wordtabs() from the Lab 5m, which was of similar form to get.wordtab(), but took a vector of strings as its first argument, str.urls, and extracted the word tables corresponding to each of the documents specified by the URL strings. The return value of get.wordtabs() was a list of word tables. Modify the definition of get.wordtabs() (and for good practice, the documentation too), so that it now returns a list with elements: wordtabs, the list of word tables, as before; number.unique.words, a vector of the number of unique words for each document; number.total.words, a vector of the number of total words for each document; and number.total.chars, a vector of the number of total characters for each document. (Hint: your modification of get.wordtabs() should still be quite short, and should call get.wordtab() to do most of the real work.) Call get.wordtabs() on a vector of four strings, which specify the appropriate URLs to the speeches from Trump, Clinton, Pence, Kaine, found in the files trump.txt, clinton.txt, paine.txt, kaine.txt, at the usual the base link. Display the first 5 elements of each returned word table, and the returned numeric summaries.

Hw4 Q7 (4 points). Further modify get.wordtabs() (and for good practice, the documentation too) so that it takes the additional argument plot.wordtabs, which is FALSE as the default. When plot.wordtabs is TRUE, the function get.wordtabs() should produce as a side effect an m x n grid of plots of the word tables, where the dimensions m and n should be as small as possible and as close to equal as possible. Note that the y label for each word table plot should be set according to the default value you have already implemented in (the latest version of) get.wordtab(). (Hint: you should again let get.wordtab() do all the actual work here, i.e., the plotting of word tables. So, in get.wordtabs(), you shouldn’t have any actual plotting code, and you should just set the mfrow argument in par() appropriately.) Also, when plot.wordtabs is TRUE, ensure that the output of get.wordtabs() is returned invisibly (and normally, otherwise).

Run your function get.wordtabs() on the same vector of URL strings (pointing to the speeches from Trump, Clinton, Pence, Kaine) as in the last question, with plot.wordtabs set to TRUE, and do not define a variable to be the output of this function call. Nothing should print to the console, and a 2 x 2 grid of plots should appear of the appropriate word tables. Then run your function on a longer vector of URL strings, containing the URLs for these same four speeches, plus those from Obama and Sanders, found in the files at obama.txt and sanders.txt, up at the usual base link. Again set plot.wordtabs to be TRUE, and this time define six.speeches to be the output of your function call. Now a 3 x 2 (or 2 x 3) grid of plots should appear of the appropriate word tables. Display the 3 vectors of numeric summaries contained in six.speeches.