Name:
Andrew ID:

Instructions

You must work alone.
You will be submitting a knitted HTML file to Canvas, as usual, by Sunday May 6 at 10pm.
For full credit on each question, make sure you follow exactly what is asked, and answer each prompt. The total is 76 points (+ 14 challenge points).
You may only ask clarification questions on Piazza; you may not ask questions for help. This is a final exam, not a homework assignment.
This should go without saying (but we have had several problems in past years): do not cheat. It ends poorly for everyone.
You may want to turn on caching, which will make your R markdown document knit more quickly. You can do so by uncommenting the line in the R chunk below.

Warming up, basic data manipulations (15 points)

1a (3 points). We’re going to start on the easy side. Recall the data set from the 2016 Summer Olympics in Rio de Janeiro (taken from https://github.com/flother/rio2016). It is up on the course website at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/rio.csv. Read it into your R session as a data frame called rio. Display its dimensions and its column names. Report the total number of observations containing at least one NA in any of the variables. What percentage of observations is this, of the total number of observations?
1b (3 points). Print the observations containing top 5 longest names, separately for each gender (female and male levels of the sex variable).
1c (3 points). Report the number of names with at least one hyphen. Hint: recall grep(). Then compute the counts of hyphenated names per country (only for countries where at least one athlete has a hyphenated name). Display the top 11 countries and corresponding counts, in decreasing order of the counts.
1d (6 points). Your manager wants you to produce some data summaries. You like using plyr, but your manager is new to this package and needs some convincing that it is producing what is expected. He asks you to produce a data frame that counts the number of athletes per country, where only atheletes with complete data—no missing values in their variables—are counted. The output should be sorted in decreasing of the counts. He asks you to do this both with and without plyr, and then check that the results produced are exactly the same, using identical(). Hint: this might require you to reformat your result produced without plyr, and alter its rownames/colnames. Carry out his requests!

Debugging and merging (15 points)

2a (3 points). Below is a function called medals.comparison(), and it has a few bugs. After fixing the bugs, uncommment the lines that call all.equal() below and make sure that each line returns TRUE.

# Function: medals.comaprison, to compare the medal count from one country to 
#   the count from all others
# Inputs:
# - df: data frame (assumed to have the same column structure as the rio data 
#   frame)
# - country: string, the country to be examined (e.g., "USA")
# - medal: string, the medal type to be examined (e.g., "gold")
# Output: numeric vector of length 2, giving the medal count from the given
#   country and all other countries
medals.comparison = function(df, country, medal) {
  ind = which(rio$nationality == country)
  country.df = df[ind,]
  others = df[!ind,]
  country.sum = sum(country.df$medal)
  others.sum = sum(others$medal)
  return(country.sum, others.sum)
}

#all.equal(medals.comparison(rio, "USA", "gold"), c(139, 527))
#all.equal(medals.comparison(rio, "USA", "silver"), c(54, 601))
#all.equal(medals.comparison(rio, "USA", "bronze"), c(71, 633))
#all.equal(medals.comparison(rio[rio$sport=="rowing",], "CAN", "silver"), 
#          c(2, 46))

2b (3 points). Below is a function called birthdate.to.birthyear(), and it has bugs. As before, after fixing the bugs, uncommment the lines below that perform comparisons, and make sure that each line returns TRUE.

# Function: date.converter, to convert a dates of the form DD.MM.YY to 
#   YYYY-MM-DD (assuming the year is before 2000)
# Inputs:
# - date: factor of dates of the form DD.MM.YY 
# Output: string vector of dates of the form YYYY-MM-DD
date.converter = function(date) {
  date = as.character(date)
  date.split = strsplit(date, ".")
  date.new = lapply(date.split, function(x) {
    paste("19", x[3], "-", x[2], "-", x[1])
  })
  return(date.new)
}

#all.equal(date.converter("15.12.85"), "1985-12-15")
#all.equal(date.converter(c("08.04.82", "08.07.84")), 
#          c("1982-04-08", "1984-07-08"))

2c (3 points). Recall the data set from the fastest men’s 100m sprint times (taken from http://www.alltime-athletics.com/m_100ok.htm). It is up on the course website at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/sprint.dat. Read it into your R session, display its dimensions and columns names. Then, define a new data frame sprint.dat.best, which has only 3 columns called Name, Birthdate, and Best time, containing the name of an athelete, his birthdate, and his best (fastest) time across all appearances in the data set. Display the dimensions and first 5 rows of each your new data frame.
2d (6 points). Merge the data frames rio and sprint.dat.best by matching on athlete names. Your merged data frame should have only the rows that correspond to the matched names, and the rows should be sorted in alphabetically increasing order by these names. Also, your merged data frame should have only the columns name, gold, silver, bronze, and Best time. You may merge the data frames either manually, or using merge(). Call the result rio.merged, display its dimensions, and its first 5 rows. Now, use your merged data frame to answer the following: what is the average Best time for athletes that earned a gold medal at the Rio 2016 Olympics? A silver medal? A bronze medal? No medal? Are these numbers all increasing?
Challenge (4 points). Using your date.converter() function from Q2b, convert the Birthdate column of sprint.dat.best into YYYY-MM-DD format. Then match these birthdates to the date_of_birth column of rio. How many matches do you find? Is it more than the number of rows in your merged data set from Q2d, and if so, why would this be? What do you conclude about trying to merge sprint.dat.best and rio on athlete birthdates?

Simulation, iteration, and regression (14 points)

3a (3 points). Set the random number generator seed, using set.seed(36350). Simulate 1000 predictors $X$ from the uniform distribution on the unit interval $[-1,1]$, and 1000 responses $Y$ from the model: \[ Y \sim N(2.0 + 5.4 \cdot X, 1), \] where $N(\mu,\sigma^2)$ denotes the normal distribution with mean $\mu$ and variance $\sigma^2$. Compute the mean of your resulting vector of responses. Is this close to what you would expect?
3b (3 points). Run a linear of regression of $Y$ on $X$. Report the coefficients and standard errors. Are the coefficients are close to the true values?
3c (8 points). Now we’re going perform a simple iterative strategy to estimate the variability in our regression coefficients. It works as follows:
1. Sample 1000 $(X,Y)$ pairs with replacement from your original set of 1000 $(X,Y)$ pairs.
2. Run a regression of $Y$ on $X$, using the data constructed in from step i. Store the coefficient vector.
3. Repeat steps i and ii 100 times. You will have 100 coefficient vectors. Report the standard deviation of the intercept coefficients, and the slope coefficients.
Perform this iterative strategy (note: this is also known as bootstrapping) on your data set from Q3a. Hint: a for() loop is probably simplest. What are the standard deviations you get for the intercept and slope? Are they close to the standard errors you found in Q3b?

Text processing and functions (32 points)

4a (3 points). We’re going to be looking at data set on Wikipedia entries (a processed variant of the data set provided at https://www.coursera.org/learn/ml-foundations/lecture/c2ZTQ/loading-exploring-wikipedia-data). It is up on the course website at http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/wiki.rdata. Read it into your R session by uncommenting the line in the R chunk below. You should now have a data frame called wiki, of dimension 800 x 2. Each row represents a Wikipedia entry for a different individual. The first column name has the individual’s name, and the second column text has the text from the Wikipedia entry stored as one big long string (already stripped of punctuation marks). Display the names of the individuals in rows 5, 150, and 800.

#load(url("http://www.stat.cmu.edu/~ryantibs/statcomp-S18/data/wiki.rdata"))

4b (3 points). Write a function, find.names() that takes two arguments: df, a data frame, with columns name and text; and str, a string. Your function should find all the Wikipedia entries (in df$text) that contain the word str, ignoring the cases of characters, and return the corresponding names of individuals (in df$names), as a string vector sorted in alphabetical order. For example, find.names(wiki, "Carnegie Mellon") should return c("Alan Fletcher (composer)", "John Tarnoff", "Joshua Bloch"). Display the outputs of find.names(wiki, "Steelers") and find.names(wiki, "machine learning").
4c (6 points). Write a function create.word.list() that takes just one argument: df, a data frame, with columns name and text. Your function should create a list of word vectors, with one element per row in df. That is, the first word vector should be formed from df$text[1], the second word vector should be formed from df.text[2], and so on. Follow this workflow for creating each word vector:
- Define a vector of words by splitting the text on spaces.
- Ignore the cases of characters (convert all characters to lower case).
- Get rid of empty words, and get rid of words that contain numbers.
- Get rid of words that have fewer than 5 or more than 10 characters.
Implement this function, run it on wiki, and save the result as word.list. Display the first 10 words for the 400th Wikipedia entry.
4d (10 points). Write a function create.dt.mat() that takes just one argument: df, a data frame, with columns name and text. Your function should first call create.word.list() on df, to get the list of word vectors for each row in df. Then your function should create and return a document-term matrix, from these word vectors. Recall, a document-term matrix has dimensions: \[ \text{(number of documents)} \times \text{(number of unique words)}. \] where the columns are sorted in alphabetical order of the words. (And to be clear, here each row of df is a document.) There is one catch, however: for the unique words, we are only going to consider words that appear in at least 5 separate documents. That is, if a word appears in less than 5 separate documents, then it gets exclued from the document-term matrix. Hint: you’ll probably want to revisit how we computed document-term matrices previously in the course, on Homework 5 in particular.

Implement this function, run it on wiki, and save the result as dt.mat. Displays the dimensions of dt.mat, the sum of all of its entries, and its first 5 rows and columns.
4e (6 points). When you loaded wiki.rdata, you actually brough two objects loaded into your R session: the wiki data frame, which you’ve been working with so far, but also centers.mat, a numeric matrix of dimenion 4 x 2836. You can think of each row of centers.mat as the word counts for specially-crafted “pseudodocuments” that are prototypical of a certain kind of Wikipedia entry. Compute the squared Euclidean distance between every row of dt.mat and every row of centers.mat. As a reminder, the squared Euclidean distance between vectors $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$ is: \[ (X_1-Y_1)^2 + (X_2-Y_2)^2 + \cdots + (X_n-Y_n)^2. \] For each Wikipedia entry (every row of dt.mat), figure out which “pseudodocument” (which row of centers.mat) it is closest to in squared Euclidean distance—hence, most similar to, in a certain sense. Your result should be a numeric vector called assignment.vec whose elements take on values 1, 2, 3, or 4. That is, assignment.vec[1] will be equal to 2 if the 1st Wikipedia document is closest to the 2nd pseudodocument, assignment.vec[2] will be equal to 4 if the 2nd Wikipedia document is closest to the 4th pseudodocument, and so on. Display the counts for the number of documents closest to each of the 4 pseudodocuments.
4f (4 points). For each of the 4 pseudodocuments, print out the 15 words with the highest total word counts among only the documents that were closest to that pseudodocument. That is, you should print out 4 vectors, each with 15 words. If there are any ties in the total word counts, order the words alphabetically. Do you notice a trend in each group of 15 words?
Challenge (10 points). The pseudodocuments whose word counts are in centers.mat didn’t come from just anywhere, they came from an algorithm called $k$-means clustering. Given an input document-term matrix, let’s call it mat, this algorithm works as follows (for the choice $k=4$):
1. Randomly select 4 different rows of mat, and define centers.mat to be the submatrix defined by these rows.
2. For each row of mat, compute which row of centers.mat it is closest to in squared Euclidean distance. Create a numeric vector assignment.vec with entries 1, 2, 3, or 4 accordingly.
3. Redefine the first row of centers.mat by averaging the word counts for all documents that are closest to the first row of centers.mat (as given by assignment.vec). Do the same for the other rows of centers.mat.
4. Repeat steps ii and iii a bunch of times.
Implement this algorithm and run it on dt.mat. With your new resulting centers.mat, answer the previous question. Do the qualitative trends you found still hold?

Take-Home Final Exam

Statistical Computing, 36-350

Monday April 30, 2018

Instructions

Warming up, basic data manipulations (15 points)

Debugging and merging (15 points)

Simulation, iteration, and regression (14 points)

Text processing and functions (32 points)