Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 4 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 4 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 2. This document contains 16 of the 45 total points for Homework 4.

Huber function

The Huber function from Lab 5m is copied below, for your convenience. Modify this function so that it returns a list with two elements: huber.vals and quad.vals, the first being the evaluations of the Huber function at the input argument x (which, recall, can be a vector), and the second being the evaluations of the simple quadratic function \(y=x^2\) at the input. Display the returned list when the input is x = c(0.5, 1, 3) (and the cutoff remains at its default value of 1).

huber = function(x, a=1) {
  ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}

Modify your huber() function so that, as a side effect, it prints the string “Invented by the great Swiss statistician Peter Huber!” to the console. (Hint: recall cat().) Call your function on an input of your choosing, to demonstrate this side effect.

Hw4 Q4 (3 points). Further modify your huber() function so that, as another side effect, it produces a plot of Switzerland’s national flag. (Hint: look it up on Google; you should be able to do this using a few calls to rect().) Call your function on an input of your choosing, to demonstrate its side effects.

Get word table(s) function

The (latest version of the) function get.wordtab() from the “Return Values and Side Effects” mini-lecture is copied below for your convenience. Add an element to the returned list called number.total.chars, giving the total character count in the document. (Note: this total character count only concerns words that occur in the word table, i.e., it shouldn’t count spaces, and, if keep.numbers is FALSE, it shouldn’t count words containing numbers.) Run this function, using the default inputs, on the Trump and Clinton speeches, found in the files trump.txt and clinton.txt, up at the base link http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/. Check that Trump’s total character count is 20299, and Clinton’s total character count is 23204.

# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page 
# - split: string, specifying what to split on. Default is the regex pattern
#   "[[:space:]]|[[:punct:]]"
# - tolower: boolean, TRUE if words should be converted to lower case before
#   the word table is computed. Default is TRUE
# - keep.numbers: boolean, TRUE if words containing numbers should be kept in
#   the word table. Default is FALSE
# - plot.wordtab: boolean, TRUE if word table should be plotted as a side 
#   effect. Default is FALSE
# Output: list, containing word table, and then some basic numeric summaries

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE, plot.wordtab=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  # Get rid of words with numbers, if we're asked to
  if (!keep.numbers)
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  # Compute the word table
  wordtab = table(words)
  
  # Plot the word table, if we're asked to
  if (plot.wordtab) plot(wordtab)
  
  return(list(wordtab=wordtab,
              number.unique.words=length(wordtab),
              number.total.words=sum(wordtab)))
}

Add an argument to get.wordtab() that allows for a y label to be specified for the (potential) plot of the word table. Call this argument ylab and set its default value to NULL. Note that the value of ylab should have no effect unless plot.wordtab is TRUE. For good practice, update the documentation of your function to describe the new argument you are adding. Test out get.wordtab() on Trump’s speech, setting plot.wordtab=TRUE and ylab="Trump's word counts". Again test out this function, setting plot.wordtab=TRUE and without specifying ylab, so that is remains at its default value, NULL. What appears on the plot as the default, when ylab is NULL?

Hw4 Q5 (5 points). Modify the definition of get.wordtab() from the previous question, so that when ylab is NULL (and plot.wordtab is TRUE), a more clever y label is used as the default. For good practice, also modify the documentation as appropriate. In particular, when the URL string in str.url is of the form: http://www.someurl.com/something/else/filename.txt, we will want the default y label to be “filename.txt word counts”. Thus the main task is to extract the last bit, of the form “filename.txt”, from str.url (hint: to do so, find the last occurence of the character “/”, and extract the text after this character until the end of the string), and then we can just paste “word counts” after this bit.

To be clear: here we are just describing the default y label, in the case that the user doesn’t specify his/her own y label; when ylab is not NULL, of course, the specified y label should be used. Once you have modified get.wordtab(), call this function on Trump’s speech with plot.wordtab=TRUE and ylab="Trump's word counts", and then again on Trump’s speech with plot.wordtab=TRUE and no specification for ylab.

In the previous homework question and the lab question before that, we were primarily concerned with the plot being produced, which was a side effect of the function get.wordtab(). Note that you had to (or should have had to!) define a dummy variable to be the result of get.wordtab(), because otherwise the word table and numeric summaries would have printed out to the console (the printing of the word table being particularly annoying). R actually allows you to “gracefully” return an object from a function you define, by using invisible() in place of where you would usually write return(). The following is a demonstration of how invisible() works.

my.fun.1 = function() { return(1:10) }
my.fun.2 = function() { invisible(1:10) }

my.fun.1()

##  [1]  1  2  3  4  5  6  7  8  9 10

a = my.fun.1()
a

##  [1]  1  2  3  4  5  6  7  8  9 10

my.fun.2() # Note that nothing is printed to the console!
b = my.fun.2()
b

##  [1]  1  2  3  4  5  6  7  8  9 10

Hence you can see that, with invisible(), unless the user explicitly defines a variable to be the result of the function call, the returned object is not printed to the console. Modify get.wordtab() so that, when plot.wordtab is TRUE, the output is returned invisibly. (When plot.wordtab is FALSE, the output should be returned as usual.) Demonstrate that your modification worked by testing it out on Trump’s speeches in a few cases.

Hw4 Q6 (4 points). Recall your function get.wordtabs() from the Lab 5m, which was of similar form to get.wordtab(), but took a vector of strings as its first argument, str.urls, and extracted the word tables corresponding to each of the documents specified by the URL strings. The return value of get.wordtabs() was a list of word tables. Modify the definition of get.wordtabs() (and for good practice, the documentation too), so that it now returns a list with elements: wordtabs, the list of word tables, as before; number.unique.words, a vector of the number of unique words for each document; number.total.words, a vector of the number of total words for each document; and number.total.chars, a vector of the number of total characters for each document. (Hint: your modification of get.wordtabs() should still be quite short, and should call get.wordtab() to do most of the real work.) Call get.wordtabs() on a vector of four strings, which specify the appropriate URLs to the speeches from Trump, Clinton, Pence, Kaine, found in the files trump.txt, clinton.txt, paine.txt, kaine.txt, at the usual the base link. Display the first 5 elements of each returned word table, and the returned numeric summaries.

Hw4 Q7 (4 points). Further modify get.wordtabs() (and for good practice, the documentation too) so that it takes the additional argument plot.wordtabs, which is FALSE as the default. When plot.wordtabs is TRUE, the function get.wordtabs() should produce as a side effect an m x n grid of plots of the word tables, where the dimensions m and n should be as small as possible and as close to equal as possible. Note that the y label for each word table plot should be set according to the default value you have already implemented in (the latest version of) get.wordtab(). (Hint: you should again let get.wordtab() do all the actual work here, i.e., the plotting of word tables. So, in get.wordtabs(), you shouldn’t have any actual plotting code, and you should just set the mfrow argument in par() appropriately.) Also, when plot.wordtabs is TRUE, ensure that the output of get.wordtabs() is returned invisibly (and normally, otherwise).

Run your function get.wordtabs() on the same vector of URL strings (pointing to the speeches from Trump, Clinton, Pence, Kaine) as in the last question, with plot.wordtabs set to TRUE, and do not define a variable to be the output of this function call. Nothing should print to the console, and a 2 x 2 grid of plots should appear of the appropriate word tables. Then run your function on a longer vector of URL strings, containing the URLs for these same four speeches, plus those from Obama and Sanders, found in the files at obama.txt and sanders.txt, up at the usual base link. Again set plot.wordtabs to be TRUE, and this time define six.speeches to be the output of your function call. Now a 3 x 2 (or 2 x 3) grid of plots should appear of the appropriate word tables. Display the 3 vectors of numeric summaries contained in six.speeches.

Lab 5w: Return Values and Side Effects

Statistical Computing, 36-350

Wednesday September 28, 2016

Huber function

Get word table(s) function