Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 9. This document contains 15 of the 45 total points for Homework 5.

Practice with random numbers

Random walking and seeds

Write a while() loop to implement this procedure. Importantly, save all the positive values of \(x\) that were visited in this procedure in a vector called x.vals, and display its entries. (Hint: it is unclear a priori how many iterations will be needed in this while() loop, so you should start with x.vals = 5, a vector of length 1 with just the first known value of 5. Then append to it as the iterations proceed, by using code of the form x.vals = c(x.vals, x.new), where x.new is the newest value you want to append.)

Hw5 Bonus. If we start our random walk process, as defined above, at \(x=5\), what is the expected number of iterations we need until it terminates? How the expected number of iterations if we started it at an arbitrary value \(x\)?

Hw5 Q4 (3 points). Modify your function random.walk() defined previously so that it takes an additional argument seed: this is an integer that should be used to set the seed of the random number generator, before the random walk begins, with set.seed(). But, if seed is NULL, the default, then no seed should be set. Run your modified function random.walk() function several times with the default inputs. Each time, a different random walk trajectory should be produced. Then run it several times with the input seed equal to (say) 33. Each time, the same random walk trajectory should be produced.

Lots of coin tosses

Hw5 Q5 (8 points). A famous theorem in statistics—arguably the most famous, the Central Limit Theorem—tells us that the standardized binomial random variables in b.big.std, computed in the last question, should have a distribution very close to the standard normal distribution (i.e., normal with mean 0 and variance 1). Let’s investigate this. Replot the histogram from the previous question, and then plot the standard normal density on top as a thick black line. Does it look like it matches closely? Next, compute the proportion of entries in b.big.std that are larger or equal to a threshold value of 1.644854. Compare this to the theoretical probability that a standard normal random variable is larger than 1.644854; are the two values close? Repeat this calculation for the threshold being 1.281552, 1.036433, and 0.8416212; are the empirical proportions and theoretical probabilities close?

By comparing the empirical proportion above a threshold and the theoretical probability of a standard normal being above the same threshold, we are comparing tail areas between the observed histogram and the standard normal density. This is another way (rather than just visually comparing the histogram and the density) to check that the distribution of b.big.std is close to the standard normal distribution. In fact, an even more rigorous strategy is to compare the whole spectrum of quantiles of b.big.std to those of a standard normal random variable, in what is called a QQ plot. Compute the sample quantiles of b.big.std, using the quantile() function, at all probability levels from 0.01 to 0.99 in increments of 0.01. Call these b.big.std.quantiles (a vector of length 99). Then compute the theoretical quantiles of a standard normal distribution, at these same probability levels. Call these normal.quantiles (again a vector of length 99). Plot b.big.std.quantiles versus normal.quantiles; label the x-axis “Theoretical quantiles”, the y-axis “Sample quantiles”, and title the plot “QQ plot”. If the sample and theoretical quantiles line up—meaning that they seem to follow the line \(y=x\)—then we would say that the theoretical distribution (here the standard normal) is probably a good match for our data. Is this the case for your data?

Generating a random word table

sample(c(0,1), size=1, replace=TRUE) # Sampling once, from 0 or 1
## [1] 0
sample(c(0,1), size=5, replace=TRUE) # Sampling 3 times, from 0 or 1
## [1] 0 1 0 1 0
sample(1:10, size=5, replace=TRUE) # Sampling 5 times, between 1 and 10
## [1] 2 1 9 1 3
sample(c("Prof","Ryan","Tibs"), size=10, replace=TRUE) # Sampling 10 times,
##  [1] "Tibs" "Ryan" "Tibs" "Ryan" "Prof" "Prof" "Tibs" "Ryan" "Ryan" "Tibs"
 # from the 3 options (strings): "Prof", "Ryan", "Tibs"

Implement the above strategy, and display your word table. (Hint: you shouldn’t use a for() or while() loop here. No explicit looping is needed, and you can just use one call to sample(). You’ll also find the built-in vector letters useful, which is a vector of length 26 containing the lower case letters “a” through “z”.) How many different random words did you generate? What is the largest word count? What is the largest character count of any word, and what is this word (i.e., the longest word)?

Hw5 Q6 (4 points). The words that you generated in the last question are unrealistically long. This is because our sampling distribution chooses a character with equal probability (probability 1/27) between “a” through “z” and a space " “, i.e., it doesn’t place high enough probability on the space” " option. Let’s change our sampling distribution to the following: we place probability 1/52 on choosing any one of the letters “a” through “z”, and probability 1/2 on choosing a space " “. This makes it equally likely to get a space or any letter. Repeat the task in the previous question, but with this new sampling distribution. (Hint: you can still just use one call to sample(); think about changing the vector you pass in as the first input, i.e., the vector from which this function decides to pick things uniformly at random. You will want to change it so that you”stack the odds" of getting a space; you will find the function rep() useful.) Then answer the same questions as before: how many unique words? largest word count? largest character count, and corresponding word?