Name:
Andrew ID:
Collaborated with:
This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.
There are Homework 4 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 4 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 2. This document contains 17 of the 45 total points for Homework 4.
x.squared
in the body of the function to be the square of the input argument x
. Now, in a new line of code (outside of the function definition), define the variable x.squared
to be equal to 999. Then call huber(x=3)
, and display the value of x.squared
. What is its value? Is this affected by the function call huber(x=3)
? It shouldn’t be! Reiterate this point with several more lines of code, in which you repeatedly define x.squared
to be something different (even something nonnumeric, like a string), and then call huber(x=3)
, and demonstrate afterwards that the value of x.squared
hasn’t changed.huber = function(x, a=1) {
x.squared = x^2
ifelse(abs(x) <= a, x.squared, 2*a*abs(x)-a^2)
}
Similar to the above question, now define the variable a
to be equal to -59.6, then call huber(x=3, a=2)
, and show that the value of a
after this function call is unchanged. And repeat a few times with different assignments for the variable a
, to reiterate this point.
The previous two questions showed you that a function’s body has its own environment in which locally defined variables, like those defined in the body itself, or those defined through inputs to the function, take priority over those defined outside of the function. However, when a variable referred to the body of a function is not defined in the local environment, the default is to look for it in the global environment (outside of the function). Below is a “sloppy” implementation of the Huber function, in which the cutoff a
is not passed as an argument to the function. In a new line of code (outside of the function definition), define a
to be equal to 1.5 and then call huber.sloppy(x=3)
. What is the output? Explain why this is giving you the same result as huber(x=3, a=1.5)
. Repeat this a few times, by defining a
and then calling huber.sloppy(x=3)
, to show that the value of a
does indeed affect the function’s ouptut as expected. (Also note: nonnumeric assignments for a
here will cause an error. You should understand why!)
huber.sloppy = function(x) {
ifelse(abs(x) <= a, x^2, 2*a*abs(x)-a^2)
}
Hw4 Q8 (1 point). At last, a difference between =
and <-
, explained! Many of you have asked about this. The equal sign =
and assignment operator <-
are often used interchangeably in R, and some people will often say that a choice between the two is mostly a matter of stylistic taste. This is not the full story. Indeed, =
and <-
behave very differently when used to set input arguments in a function call. As we showed above, setting, say, a=5
as the input to huber()
has no effect on the global assignment for a
. Just to demonstrate this once more:
a = 10
huber(x=3, a=5)
## [1] 9
a
## [1] 10
However, replacing a=5
with a<-5
in the call to huber()
leads to a very different result, in that does affect the global assignment for a
. Do so, and show that this is indeed the case.
Hw4 Q9 (1 point). The story now gets even more subtle. It turns out that the assignment operator <-
allows us to define global variables even when we are specifying inputs to a function. Pick a variable name that has not been defined yet in your workspace, say b
(or something else, if this has already been used in your R Markdown document). Call huber(x=3, b<-20)
. Then display the value of b
: this variable should now exist in the global enviroment, and it should be equal to 20!
Hw4 Q10 (2 points). The property of the assignment operator <-
demonstrated in the last question, although tricky, can also be pretty useful. Leverage this property to plot the function \(y=0.05x^2 - \sin(x)\cos(x) + 0.1\exp(1+\log(x))\) over 50 x values between 0 and 2, using only one line of R code and one call to the function seq()
(or one use of the colon operator :
).
Hw4 Q11 (2 points). Finally, give an example to show that the property of the assignment operator <-
demonstrated in the last two questions does not hold in the body of a function. That is, give an example in which <-
is used in the body of a function to define a variable, but this doesn’t translate into global assignment.
get.dt.mat()
. This function takes a vector of strings (each of which is a URL), and builds a document-term matrix from the corresponding documents. This code sketch was actually perfectly valid R code (which doesn’t need to be the case for a code sketch in general) and for your convenience, it is defined below. Also defined are the functions get.wordtabs()
and get.wordtab()
, from Lab 5m, which will be used here. You don’t have to do anything yet, but it will be your job in the rest of the lab and homework questions to define the dt.mat.from.wordtab()
so that you can eventually run get.dt.mat()
. (Also, in case you’re wondering, giving you a break from documenting these functions so that your head spin too much … but remember, in general, consistent, proper documentation is extremely important!)get.dt.mat = function(str.urls, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
# First, compute all the individual word tables
wordtabs = get.wordtabs(str.urls, split, tolower, keep.numbers)
# Then, build the document-term matrix from these, and return it
return(dt.mat.from.wordtabs(wordtabs))
}
get.wordtabs = function(str.urls, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
wordtabs = list()
for (i in 1:length(str.urls)) {
wordtabs[[i]] = get.wordtab(str.urls[i], split, tolower, keep.numbers)
}
return(wordtabs)
}
get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.numbers=FALSE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
# Get rid of words with numbers, if we're asked to
if (!keep.numbers)
words = grep("[0-9]", words, inv=TRUE, val=TRUE)
table(words)
}
Write a code sketch for the function dt.mat.from.wordtabs()
. This doesn’t have to be runnable R code, just something to get you started. Recall that there are two steps to building a document-term matrix (from the “Bag-of-Words” mini-lecture and/or Lab 2f): getting all the unique words across all the documents, and then filling out the document-term matrix row by row. Therefore your code sketch should highlight these two steps.
Let’s complete the first task: getting all the unique words. Using the example list of word tables wordtabs
below, write code to extract all the unique words appearing in the documents. Then sort the unique words in alphabetical order. (Hint: it will help to use a for()
loop. For a reminder of how this works, look back at the way get.wordtabs()
is defined.) Show the first 10 unique words in alphabetical order.
wordtabs = get.wordtabs(c(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt",
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt"))
Hw4 Q12 (6 points). Now let’s complete the second task: building the document-term matrix row by row. Again, using the example list of word tables wordtabs
from the previous question, write code to build the document-term matrix. (Hint: it will again help to use a for()
loop.) Name the rows of the document-term matrix “Candidate 1” through “Candidate 4”. The reason for this will be clear shortly. Show all 4 rows and the first 10 columns of the document-term matrix.
Hw4 Q13 (5 points). Finally, define the function dt.mat.from.wordtabs()
, by substituting the code you wrote in the last two questions into the initial code sketch you wrote. This function should produce a document-term matrix with row names being “Candidate 1” through “Candidate N”, where N is the total number of documents being considered. With this, you should now be able to run get.dt.mat()
. Run this function on the speeches from Trump, Clinton, Pence, and show all 3 rows and the first 25 columns of the resulting document-term matrix.
Hw4 Bonus. Modify the above functions as appropriate so that the row names of the document-term matrix, of the form “Candidate 1” through “Candidate N”, are replaced by something more descriptive, that better reflects the identity of the documents. For example, look back at Hw4 Q5, where you set a clever default y label—the same strategy could be used here. Run your modification on an example to show what it produces.