Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 4 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 4 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 2. This document contains 12 of the 45 total points for Homework 4.

Important remark on compiling the homework: many homework questions depend on variables that are defined in the surrounding lab. It is easiest just to copy and paste the contents of all these labs into one big Rmd file, with your lab solutions and homework solutions filled in, knit it, and submit the HTML file.

Huber function

The Huber function is defined as: \[ \psi(x) = \begin{cases} x^2 & \text{if $|x| \leq 1$} \\ 2|x| - 1 & \text{if $|x| > 1$} \end{cases} \] This function is quadratic on the interval [-1,1], and linear outside of this interval. It transitions from quadratic to linear “smoothly”, and looks like this:

(You don’t have to do anything yet.)
Write a function huber() that takes as an input a number $x$, and returns the Huber value $\psi(x)$, as defined above. (Hint: the body of a function is just a block of R code, so you can use things you know like if() and else() statements.) Check that huber(1) returns 1, and huber(4) returns 7.
The Huber function can be modified so that the transition from quadratic to learn happens at an arbitrary cutoff value $a$, as in: \[ \psi_a(x) = \begin{cases} x^2 & \text{if $|x| \leq a$} \\ 2a|x| - a^2 & \text{if $|x| > a$} \end{cases} \] Update your function huber() so that it takes two arguments: $x$, a number at which to evaluate the loss, and $a$ a number representing the cutoff value. It should now return $\psi_a(x)$, as defined above. Check that huber(3, 2) returns 8, and huber(3, 4) returns 9.
Update your function huber() so that the default value of the cutoff $a$ is 1. Check that huber(3) returns 5.
Check that huber(a=1, x=3) returns 5. Check that huber(1, 3) returns 1. Explain why these are different.
Vectorize your function huber(), so that the first input can actually be a vector of numbers, and what is returned is a vector whose elements give the Huber evaluated at each of these numbers. (Hint: you might try using ifelse(), if you haven’t already, to vectorize nicely.) Check that huber(x=1:6, a=3) returns the vector of numbers (1, 4, 9, 15, 21, 27).

Hw4 Q1 (4 points). Reproduce the plot of the Huber function that you see above at the start of the lab. The axes and title should be just the same, so should the Huber curve (in black), so should be the red dotted lines at the values -1 and 1, and so should the text “Linear”, “Quadratic”, “Linear”.

Hw4 Q2 (2 points). Your instructor computed the Huber function values $\psi_a(x)$ over a bunch of different $x$ values, stored in huber.vals and x.vals, respectively. However, the cutoff $a$ was, let’s say, lost. Using huber.vals, x.vals, and the definition of the Huber function, you should be able to figure out the cutoff value $a$, at least roughly. Estimate $a$ and explain how you got there. (Hint: draw in R or on a piece of paper the quadratic function $y=x^2$ on top of the Huber function; when are they different?)

x.vals = seq(0, 5, length=21)
huber.vals = c(0.0000, 0.0625, 0.2500, 0.5625, 1.0000, 1.5625, 2.2500,
               3.0625, 4.0000, 5.0625, 6.2500, 7.5625, 9.0000, 10.5000,
               12.0000, 13.5000, 15.0000, 16.5000, 18.0000, 19.5000, 
               21.0000)

Get word table function

The (latest version of the) function get.wordtab() from the “Function Basics” mini-lecture is copied below for your convenience. Run this function on the speeches from Trump, Clinton, Pence, Kaine, stored in the files trump.txt, clinton.txt, paine.txt, kaine.txt, available at the usual the base link, and save the results as trump.wordtab, clinton.wordtab, pence.wordtab, kaine.wordtab, respectively. Don’t specify split and tolower, keeping them at their default values. (See how easy that was! Just four lines of code!) Report how many times Trump said “her”, Clinton said “him”, and Pence and Kaine each said “but”. Then plot the word tables in a 2 x 2 grid, with Trump’s on the top left, Pence’s on the top right, Clinton’s on the bottom left, and Kaine’s on the bottom right.

# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page 
# - split: string, specifying what to split on. Default is the regex pattern
#   "[[:space:]]|[[:punct:]]"
# - tolower: boolean, TRUE if words should be converted to lower case before
#   the word table is computed. Default is TRUE
# Output: word table, i.e., vector with counts as entries and associated
#   words as names

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  table(words)
}

Redefine the function get.wordtab() to add the input keep.numbers, which takes a boolean (TRUE or FALSE), and if set to FALSE, discards the words in the word table that contain numbers, before returning this word table. Its default value should be FALSE. For good practice, properly document your new function with comments (most of this can just be copied straight from the above, but you will want to add an appropriate description for your new input). After redefining get.wordtab(), recompute the word tables for Trump, Clinton, Pence, and Kaine with keep.numbers at its default value, and replot them in a 2 x 2 grid. (See how easy this was! You didn’t need to edit four separate code chunks to remove numbers from the word tables. This is the beauty of functions.) As a sanity check, you shouldn’t see any numbers at the very left end of the x-axes on these plots.
Interlude: a quick demo/reminder about loops and lists (this will help with the next question). Remember that a for() loop in R takes the form:

sum = 0
for (i in 1:10) {
  sum = sum + i
}
sum

## [1] 55

Here, we’ve added the numbers between 1 and 10, stored in the variable sum. Below, we demonstrate how to use a for() loop to populate a list of length 5, where the 1st element contains the numeric 1, the 2nd element contains the numeric 2, etc.

my.list = list()
for (i in 1:5) {
  my.list[[i]] = i
}
my.list

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

(You don’t have to do anything for this part; just study this code, so that you can solve the next question.)

Create a function called get.wordtabs() that takes the same inputs as (your latest version of) get.wordtab(), except that its first input should be str.urls, which is now a vector of strings, each element specifying a URL. For good practice, properly document the function with commments (most of this can just be copied straight from the above, but you will want to modify as appropriate). You will want your function get.wordtabs() to compute a word table for each web page in str.urls, and return a list of these word tables. (Hint: create an empty list, and populate it with a for() loop, where in every iteration you just call get.wordtab(). That is, do not copy the body of get.wordtab() over here, just call this function as appropriate. The body of get.wordtabs() therefore shouldn’t be very long … and again, this is the beauty of working with functions!)

Hw4 Q3 (6 points). Run your function get.wordtabs() on a vector of the four strings which specify the appropriate URLs to the speeches from Trump, Clinton, Pence, Kaine. Save the result as four.wordtabs. Check that its entries (which are word tables) are equal to trump.wordtab, clinton.wordtab, pence.wordtab, kaine.wordtab, respectively (which you computed previously). (Hint: use all(), as demonstrated in the “Function Basics” mini-lecture.)

Then use get.wordtabs() to get the word tables for the Gingrich, Melania Trump, Obama, and Sanders speeches, which are stored in the files gingrich.txt, melania.txt, obama.txt, sanders.txt, at the usual base link http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/. Save the result as four.more.wordtabs, and plot these four word tables in a 2 x 2 grid, with Gingrich’s and Melania Trump’s in the first row, and Obama’s and Sanders’ in the bottom row.

Lab 5m: Function Basics

Statistical Computing, 36-350

Monday September 26, 2016

Huber function

Get word table function