Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 9. This document contains 16 of the 45 total points for Homework 5.

Practice with `for()`

Write a for() loop to add, elementwise, the two vectors x, y defined below. Check that the result is exactly the same as what you get with the usual (vectorized) addition operation in R.

x = c(1.2, -1.5, 2, 2.7, 0)
y = c(0.8, -0.5, 0, -0.7, 2)

Write a for() loop to sum up the logs of all elements of the vector z below that are positive. Show that you can perform the same task without a for loop, just with careful vector subsetting and one call to sum(). Check that the results are the same, either way you do it.

z = c(-5.3, 1.4, 0.1, 2.9, -0.5, 10.8, -0.7)

Write a for() loop to multiply the two matrices a, b defined below. Note: here we’re asking for a proper matrix multiplication. Hence, because a is 3 x 2 and b is 2 x 4, the resulting matrix, call it c, should be 3 x 4. (Hint: you’ll need a nested for() loop. If you need to remind yourself about matrix multiplication, sketch it out on a piece of paper!) Check that c is equal to the result of using multiplying a, b using the usual matrix multiplication operator.

a = matrix(c(-0.15, -0.33, 0.72, -0.64, -0.01, -0.14), 3, 2)
b = matrix(c(0.66, -0.77, 0.92, -0.70, -0.71, 0.85, 0.01, -0.69), 2, 4)

Write a function mat.mult() to perform manual matrix multiplication, via a nested for() loop, as you implemented in the last question. Its inputs should be: a and b, two matrices (whose dimensions we will assume are compatible). Its output should be: the matrix equal to the product of a and b. Demonstrate that your function works by calling it on two matrices and showing that the output equals the result given by the usual matrix multiplication operator. (These can be anything, but make sure they have different dimensions than the ones defined in the previous question—that way, this is a “real” test for your function.)

Hw5 Q1 (3 points). Below we generate two matrices a.big, b.big of dimensions 500 x 200 and 200 x 300, respectively, filled with standard normals. Time how long it takes your function mat.mult() to multiply these two. Also time how long it takes with the usual matrix multiplication operator. For the timings, you can use proc.time(), as demonstrated below, with some dummy code. (The third element of the returned vector of timings, “elapsed”, is what you can report.) Is there a big difference? Also, to verify that your function mat.mult() is still doing the right thing, report the maximum absolute difference between its output and the result of standard matrix multiplication here.

a.big = matrix(rnorm(500*200), 500, 200)
b.big = matrix(rnorm(200*300), 200, 300)

t0 = proc.time()
x = sqrt(log(1+exp(rnorm(100)*1:100)))
Sys.sleep(1.5) # Really, any code can go here
proc.time() - t0

##    user  system elapsed 
##   0.007   0.004   1.503

Practice with `while()`

The code below uses a for() loop to populate the entries of a numeric vector log.vec with the values log(1) through log(10). Modify it so that it uses a while() loop, rather than a for() loop, to accomplish the same purpose.

n = 10
log.vec = vector(length=n, mode="numeric")
for (i in 1:n) {
  log.vec[i] = log(i)
}
log.vec

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

The function below let.to.num() takes a lower case letter and converts it to a number according to its order in the alphabet. Also, it counts spaces as 0. (For the purpose of this question, it is not vectorized—i.e., it only expects a single character.) Write a while() loop to complete the following task. Iterate over the characters in the string str below, convert each to a number using let.to.num(), and keep their running sum. Stop as soon as this sum exceeds 300, and display the resulting substring. To be clear, this substring should contain all the characters in str up until the last point where their sum is less or equal to (does not exceed) 300. (Hint: recall substr().)

let.to.num = function(char) { match(char[1], letters, no=0) }
let.to.num("c"); let.to.num("")

## [1] 3

## [1] 0

str = "When are you going to stop iterating over me this is getting tiring"

Hw5 Q2 (7 points). The Babylonians were apparently pretty smart. They devised the following algorithm to compute \(\sqrt{x}\), using only the basic arithmetic operations (addition, subtraction, multiplication, division). We take a first guess \(r\), let’s say \(r=x/2\) for concreteness. Either \(r^2 > x\), \(r^2 < x\), or \(r^2 = x\). Now:

If \(r^2=x\), we can stop
If \(r^2 > x\), then \(r > \sqrt{x}\), and \(x/r < \sqrt{x}\)
If \(r^2 < x\), then \(r < \sqrt{x}\), and \(x/r > \sqrt{x}\)

Therefore, in the latter two cases, we can replace \(r\) with average of \(r\) and \(x/r\), and repeat. Write a function baby.root() that takes as inputs x, the numeric variable whose square root we want to compute; and tol, a numeric variable whose default value is 1e-6. Your function should implement the Babylonian method of root finding described above, and stop when \(|r^2 - x|\) is less than tol (which stands for “tolerance”). Your function should use a while() loop. Finally, it should output a list with two elements: x.sqrt, the value of \(r\) in the above description at convergence (once the loop has terminated); and n.iter the number of iterations taken in the while() loop before convergence. Run your function when x is equal (separately) to 2, 4, 10, 99, and tol is kept at the default value. Compare the results to the actual square roots as computed by sqrt(), by computing the absolute difference between the approximated and actual square root in each case. Also display the display the numbers of iterations needed in each case.

Compare documents function

In the “Function Design” mini-lecture, we sketched a function compare.docs() that computed a document-term matrix from a vector of strings (each being a URL containing text on the web), and then optionally printed out summaries. The code is given below, for your convenience. In Lab 5f, we completed the first step, and wrote the get.dt.mat() function. This is again given below, for your convenience. Note: you don’t have to do anything yet, this is just a setup for the questions to come.

compare.docs = function(str.urls, split="[[:space:]]|[[:punct:]]",
                        tolower=TRUE, keep.numbers=FALSE, print.summary=TRUE) {
  # Compute the document-term matrix
  dt.mat = get.dt.mat(str.urls, split, tolower, keep.numbers)
  # Print a summary, if we're asked to
  if (print.summary) print.dt.mat(dt.mat)
  # Compute correlations
  cor.mat = cor(t(dt.mat))
  # Return a list with document-term matrix and correlations
  return(list(dt.mat=dt.mat, cor.mat=cor.mat))
}

get.dt.mat = function(str.urls, split="[[:space:]]|[[:punct:]]",
                      tolower=TRUE, keep.numbers=FALSE) {
  # First, compute all the individual word tables
  wordtabs = get.wordtabs(str.urls, split, tolower, keep.numbers)
  # Then, build the document-term matrix from these, and return it
  return(dt.mat.from.wordtabs(wordtabs))
}

get.wordtabs = function(str.urls, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  wordtabs = list()
  for (i in 1:length(str.urls)) {
    wordtabs[[i]] = get.wordtab(str.urls[i], split, tolower, keep.numbers)
    k = max(max(gregexpr("/", str.urls[i])[[1]]), 1)
    names(wordtabs)[i] = substr(str.urls[i], k+1, nchar(str.urls[i]))
  }
  return(wordtabs)
}

get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
                       tolower=TRUE, keep.numbers=FALSE) {
  lines = readLines(str.url)
  text = paste(lines, collapse=" ")
  words = strsplit(text, split=split)[[1]]
  words = words[words != ""]
    
  # Convert to lower case, if we're asked to
  if (tolower) words = tolower(words)
  
  # Get rid of words with numbers, if we're asked to
  if (!keep.numbers)
    words = grep("[0-9]", words, inv=TRUE, val=TRUE)
  
  table(words)
}

dt.mat.from.wordtabs = function(wordtabs) {
  # First get all the unique words
  all.words = c()
  for (i in 1:length(wordtabs)) {
    all.words = c(all.words, names(wordtabs[[i]]))
  }
  all.words.unique = sort(unique(all.words))
  
  # Then build the document-term matrix
  dt.mat = matrix(0, length(wordtabs), length(all.words.unique))
  colnames(dt.mat) = all.words.unique
  rownames(dt.mat) = names(wordtabs)
  for (i in 1:length(wordtabs)) {
    dt.mat[i, names(wordtabs[[i]])] = wordtabs[[i]] 
  }
  
  return(dt.mat)
}

Below we run get.dt.mat() to produce a document-term matrix dt.mat from the speeches given by Trump, Clinton, Pence, Kaine at the 2016 Republican and Democratic National Conventions. On just the first row of dt.mat, Trump’s document, show how to use appropriate indexing and vectorization to compute the following quantities: the total number of words used by Trump, the total number of characters used by Trump, the most common word used by the Trump, and the longest word used by Trump. To reiterate, you should not use for() or while() loops here; but instead stick to appropriate indexing and vectorization. You might find it useful to use the variables words and wlens, computed for your convenience below. (Hint: the last quantity, the longest word used by Trump, is the trickiest to compute. But consider modifying wlens so that we “zero out”, i.e., assign a zero value, to all words that were not used by Trump, and then find the maximum value of this modified word lengths vector.)

dt.mat = get.dt.mat(c(
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/clinton.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/pence.txt",
  "http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/kaine.txt"))

words = colnames(dt.mat)
wlens = nchar(words)

Hw5 Q3 (6 points). Write a function print.dt.mat() that takes a document-term matrix dt.mat as an input, and returns nothing, but as a side effect, prints numeric summaries of the documents. Specifically, for each row of dt.mat, this function should print a summary on a new line in the console, reporting for that particular row: the total number of words, the total number of characters, the most common word, and the longest word. You should use a for() loop over the rows of dt.mat, and for the body of this loop, use appropriate indexing and vectorization as you did in the last quantity to compute each of these quantities. To reiterate, you should not have a nested for() loop; i.e., the four quantities here should be computable without looping, just as in the last question. (Hint: to print to the console, use cat() and paste(). Also take a look at the examples in the mini-lecture “Iteration Basics”.)

The format of each line printed summary should be exactly as follows:

“<name> – total words: <num1>, total chars: <num2>, most common: <wrd1>, longest: <wrd2>”

where <num1>, <num2>, <wrd1>, <wrd2> are placeholders for quantities that you will compute. Also, <name> is a placeholder for the name of the current row of the document-term matrix. So, for example, when dt.mat is as defined in the last question, the printed out summary for its first row, Trump’s speech, should be exactly as follows:

“trump.txt – total words: 4431, total chars: 20299, most common: the, longest: representative”

Run your function print.dt.mat() on dt.mat, the document-term matrix from the Trump, Clinton, Pence, Kaine speeches. Then, run the function compare.docs() (finally finished!) on the appropriate vector of strings, containing URLs to the speeches from Trump, Clinton, Pence, Kaine, with the rest of the inputs set to their default values. Show all 4 rows and the first 10 columns of the resulting document-term matrix. Also display the correlation matrix.

Lab 6m: Iteration Basics

Statistical Computing, 36-350

Monday October 2, 2016

Practice with `for()`

Practice with `while()`

Compare documents function

Lab 6m: Iteration Basics

Statistical Computing, 36-350

Monday October 2, 2016

Practice with for()

Practice with while()

Compare documents function

Practice with `for()`

Practice with `while()`