Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 9. This document contains 16 of the 45 total points for Homework 5.

Practice with for()

x = c(1.2, -1.5, 2, 2.7, 0)
y = c(0.8, -0.5, 0, -0.7, 2)
z = c(-5.3, 1.4, 0.1, 2.9, -0.5, 10.8, -0.7)
a = matrix(c(-0.15, -0.33, 0.72, -0.64, -0.01, -0.14), 3, 2)
b = matrix(c(0.66, -0.77, 0.92, -0.70, -0.71, 0.85, 0.01, -0.69), 2, 4)

Hw5 Q1 (3 points). Below we generate two matrices a.big, b.big of dimensions 500 x 200 and 200 x 300, respectively, filled with standard normals. Time how long it takes your function mat.mult() to multiply these two. Also time how long it takes with the usual matrix multiplication operator. For the timings, you can use proc.time(), as demonstrated below, with some dummy code. (The third element of the returned vector of timings, “elapsed”, is what you can report.) Is there a big difference? Also, to verify that your function mat.mult() is still doing the right thing, report the maximum absolute difference between its output and the result of standard matrix multiplication here.

a.big = matrix(rnorm(500*200), 500, 200)
b.big = matrix(rnorm(200*300), 200, 300)

t0 = proc.time()
x = sqrt(log(1+exp(rnorm(100)*1:100)))
Sys.sleep(1.5) # Really, any code can go here
proc.time() - t0
##    user  system elapsed 
##   0.007   0.004   1.503

Practice with while()

n = 10
log.vec = vector(length=n, mode="numeric")
for (i in 1:n) {
  log.vec[i] = log(i)
}
log.vec
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851
let.to.num = function(char) { match(char[1], letters, no=0) }
let.to.num("c"); let.to.num("")
## [1] 3
## [1] 0
str = "When are you going to stop iterating over me this is getting tiring"

Hw5 Q2 (7 points). The Babylonians were apparently pretty smart. They devised the following algorithm to compute \(\sqrt{x}\), using only the basic arithmetic operations (addition, subtraction, multiplication, division). We take a first guess \(r\), let’s say \(r=x/2\) for concreteness. Either \(r^2 > x\), \(r^2 < x\), or \(r^2 = x\). Now:

Therefore, in the latter two cases, we can replace \(r\) with average of \(r\) and \(x/r\), and repeat. Write a function baby.root() that takes as inputs x, the numeric variable whose square root we want to compute; and tol, a numeric variable whose default value is 1e-6. Your function should implement the Babylonian method of root finding described above, and stop when \(|r^2 - x|\) is less than tol (which stands for “tolerance”). Your function should use a while() loop. Finally, it should output a list with two elements: x.sqrt, the value of \(r\) in the above description at convergence (once the loop has terminated); and n.iter the number of iterations taken in the while() loop before convergence. Run your function when x is equal (separately) to 2, 4, 10, 99, and tol is kept at the default value. Compare the results to the actual square roots as computed by sqrt(), by computing the absolute difference between the approximated and actual square root in each case. Also display the display the numbers of iterations needed in each case.

Compare documents function

compare.docs = function(str.urls, split="[[:space:]]|[[:punct:]]",
                        tolower=TRUE, keep.numbers=FALSE, print.summary=TRUE) {
  # Compute the document-term matrix
  dt.mat = get.dt.mat(str.urls, split, tolower, keep.numbers)
  # Print a summary, if we're asked to
  if (print.summary) print.dt.mat(dt.mat)
  # Compute correlations
  cor.mat = cor(t(dt.mat))
  # Return a list with document-term matrix and correlations
  return(list(dt.mat=dt.mat, cor.mat=cor.mat))
}

get.dt.mat = function(str.urls, split="[[:space:]]|[[:punct:]]",
                      tolower=TRUE, keep.numbers=FALSE) {
  # First, compute all the individual word tables
  wordtabs = get.wordtabs(str.urls, split, tolower, keep.numbers)
  # Then, build the document-term matrix from these, and return it
  return(dt.mat.from.wordtabs(wordtabs))
}

get.wordtabs = function(str.urls, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  wordtabs = list()
  for (i in 1:length(str.urls)) {
    wordtabs[[i]] = get.wordtab(str.urls[i], split, tolower, keep.numbers)
    k = max(max(gregexpr("/", str.urls[i])[[1]]), 1)
    names(wordtabs)[i] = substr(str.urls[i], k+1, nchar(str.urls[i]))
  }
  return(wordtabs)
}

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  # Get rid of words with numbers, if we're asked to
  if (!keep.numbers)
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  table(words)
}

dt.mat.from.wordtabs = function(wordtabs) {
  # First get all the unique words
  all.words = c()
  for (i in 1:length(wordtabs)) {
    all.words = c(all.words, names(wordtabs[[i]]))
  }
  all.words.unique = sort(unique(all.words))
  
  # Then build the document-term matrix
  dt.mat = matrix(0, length(wordtabs), length(all.words.unique))
  colnames(dt.mat) = all.words.unique
  rownames(dt.mat) = names(wordtabs)
  for (i in 1:length(wordtabs)) {
    dt.mat[i, names(wordtabs[[i]])] = wordtabs[[i]] 
  }
  
  return(dt.mat)
}
dt.mat = get.dt.mat(c(
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt"))

words = colnames(dt.mat)
wlens = nchar(words)

Hw5 Q3 (6 points). Write a function print.dt.mat() that takes a document-term matrix dt.mat as an input, and returns nothing, but as a side effect, prints numeric summaries of the documents. Specifically, for each row of dt.mat, this function should print a summary on a new line in the console, reporting for that particular row: the total number of words, the total number of characters, the most common word, and the longest word. You should use a for() loop over the rows of dt.mat, and for the body of this loop, use appropriate indexing and vectorization as you did in the last quantity to compute each of these quantities. To reiterate, you should not have a nested for() loop; i.e., the four quantities here should be computable without looping, just as in the last question. (Hint: to print to the console, use cat() and paste(). Also take a look at the examples in the mini-lecture “Iteration Basics”.)

The format of each line printed summary should be exactly as follows:

“<name> – total words: <num1>, total chars: <num2>, most common: <wrd1>, longest: <wrd2>”

where <num1>, <num2>, <wrd1>, <wrd2> are placeholders for quantities that you will compute. Also, <name> is a placeholder for the name of the current row of the document-term matrix. So, for example, when dt.mat is as defined in the last question, the printed out summary for its first row, Trump’s speech, should be exactly as follows:

“trump.txt – total words: 4431, total chars: 20299, most common: the, longest: representative”

Run your function print.dt.mat() on dt.mat, the document-term matrix from the Trump, Clinton, Pence, Kaine speeches. Then, run the function compare.docs() (finally finished!) on the appropriate vector of strings, containing URLs to the speeches from Trump, Clinton, Pence, Kaine, with the rest of the inputs set to their default values. Show all 4 rows and the first 10 columns of the resulting document-term matrix. Also display the correlation matrix.