Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 9. This document contains 15 of the 45 total points for Homework 5.

Practice with random numbers

Draw the following random variables (and do not set the seed before you generate them). In each case, don’t display the actual draws, but display their sample mean, sample variance, and range (max minus min). (Hint: you’ll have to use functions like rnorm(), but with “norm” replaced by the appropriate distribution; to find out which functions exactly, you can search around on Google or in the R help files.) Are the sample statistics (mean, variance, range) what you’d expect?
- 5000 normal random variables, with mean 1 and variance 8
- 4000 t random variables, with 5 degrees of freedom
- 3500 Poisson random variables, with mean 4
- 999 chi-squared random variables, with 11 degrees of freedom
- 2000 uniform random variables, between -sqrt(12)/2 and sqrt(12)/2
Repeat the last question (and again do not set the seed before you generate the random variables). You can simply copy and paste your code below again. This is just to emphasize the (obvious!) point: each time you generate random numbers in R, you get different results.
Plot a (separate) histogram for each of the 5 sets of random variables you’ve generated. The histograms should be on the probability scale (rather than frequency scale), and labeled appropriately. You can color them to your liking. Also, draw the appropriate density function as a thick black line on top of each histogram. This should come from the actual known distribution, and not from a call to density() (which would estimate the density empirically). For example, for the normal random variables, you should draw the density using dnorm(). Also, note: for the discrete Poisson distribution, you can only evaluate its density at integer values—this is because it has a probability mass function rather than a continuous density function.

Random walking and seeds

Consider the following “random walk” procedure:
- Start with \(x=5\)
- Draw a random number \(r\) uniformly between -2 and 1
- Replace \(x\) by \(x+r\)
- Stop if \(x \leq 0\)
- Else repeat

Write a while() loop to implement this procedure. Importantly, save all the positive values of \(x\) that were visited in this procedure in a vector called x.vals, and display its entries. (Hint: it is unclear a priori how many iterations will be needed in this while() loop, so you should start with x.vals = 5, a vector of length 1 with just the first known value of 5. Then append to it as the iterations proceed, by using code of the form x.vals = c(x.vals, x.new), where x.new is the newest value you want to append.)

Write a function random.walk() to perform the random walk procedure that you implemented in the last question. Its inputs should be: x.start, a numeric value at which we will start the random walk, which takes a default value of 5; and plot.walk, a boolean value, indicating whether or not we want to produce a plot of the random walk values x.vals versus the iteration number as a side effect, which takes a default value of TRUE. Make sure the plot has an appropriately labeled x-axis and and y-axis. Also use type="o" so that we can see both points and lines. The output of your function should be a list with elements: x.vals, a vector of the random walk values as computed above; and num.steps, the number of steps taken by the random walk before terminating. Run your function twice with the default inputs, and then twice times with x.start equal to 10. Each time it is called, you should see a different random walk trajectory.

Hw5 Bonus. If we start our random walk process, as defined above, at \(x=5\), what is the expected number of iterations we need until it terminates? How the expected number of iterations if we started it at an arbitrary value \(x\)?

Hw5 Q4 (3 points). Modify your function random.walk() defined previously so that it takes an additional argument seed: this is an integer that should be used to set the seed of the random number generator, before the random walk begins, with set.seed(). But, if seed is NULL, the default, then no seed should be set. Run your modified function random.walk() function several times with the default inputs. Each time, a different random walk trajectory should be produced. Then run it several times with the input seed equal to (say) 33. Each time, the same random walk trajectory should be produced.

Lots of coin tosses

The binomial distribution with size \(m\) and success probability \(p\) describes the number of successes (say, heads) in \(m\) independent coin tosses, each having probability \(p\) of success (probability \(p\) of turning up heads). Use rbinom() to generate 5000 binomial random variables with size 16 and success probability 0.5, saving the result as b. Compute the sample mean and sample standard deviation of these random variables. Are they close to what you would expect? Then plot a histogram of the random variables in b, with a large value of the breaks input (say breaks=100), on the probability scale. What do you notice about its support (meaning, what do you notice about where the histogram bars occur)? Is this surprising?
R’s capacity for simulation and computation is impressive, having grown a lot even compared to what was available just 10 years ago. To demonstrate this, generate 5 million binomial random variables with size 1 million and success probability 0.5, saving the result as b.big. Standardize these random variables, saving the result as b.big.std: that is, subtract off the sample mean, and then divide by the sample standard deviation. As a sanity check, compute the sample mean and sample standard deviation of b.big.std—these should be (very close to, if not exactly) 0 and 1, respectively (if not, you’ve made a mistake somewhere). Then plot a histogram of the random variables in b.big.std, again with a large value of the breaks input (say breaks=100), on the probability scale. What do you notice about its shape, i.e., what distribution does this look like?

Hw5 Q5 (8 points). A famous theorem in statistics—arguably the most famous, the Central Limit Theorem—tells us that the standardized binomial random variables in b.big.std, computed in the last question, should have a distribution very close to the standard normal distribution (i.e., normal with mean 0 and variance 1). Let’s investigate this. Replot the histogram from the previous question, and then plot the standard normal density on top as a thick black line. Does it look like it matches closely? Next, compute the proportion of entries in b.big.std that are larger or equal to a threshold value of 1.644854. Compare this to the theoretical probability that a standard normal random variable is larger than 1.644854; are the two values close? Repeat this calculation for the threshold being 1.281552, 1.036433, and 0.8416212; are the empirical proportions and theoretical probabilities close?

By comparing the empirical proportion above a threshold and the theoretical probability of a standard normal being above the same threshold, we are comparing tail areas between the observed histogram and the standard normal density. This is another way (rather than just visually comparing the histogram and the density) to check that the distribution of b.big.std is close to the standard normal distribution. In fact, an even more rigorous strategy is to compare the whole spectrum of quantiles of b.big.std to those of a standard normal random variable, in what is called a QQ plot. Compute the sample quantiles of b.big.std, using the quantile() function, at all probability levels from 0.01 to 0.99 in increments of 0.01. Call these b.big.std.quantiles (a vector of length 99). Then compute the theoretical quantiles of a standard normal distribution, at these same probability levels. Call these normal.quantiles (again a vector of length 99). Plot b.big.std.quantiles versus normal.quantiles; label the x-axis “Theoretical quantiles”, the y-axis “Sample quantiles”, and title the plot “QQ plot”. If the sample and theoretical quantiles line up—meaning that they seem to follow the line \(y=x\)—then we would say that the theoretical distribution (here the standard normal) is probably a good match for our data. Is this the case for your data?

Generating a random word table

The sample() function in R, with the input argument replace=TRUE, gives us access to the discrete uniform distribution (as opposed to runif(), which gives us access to the continuous uniform distribution). Examples are given below. (You don’t have to do anything here, this is just a demo of how sample() works.)

sample(c(0,1), size=1, replace=TRUE) # Sampling once, from 0 or 1

## [1] 0

sample(c(0,1), size=5, replace=TRUE) # Sampling 3 times, from 0 or 1

## [1] 0 1 0 1 0

sample(1:10, size=5, replace=TRUE) # Sampling 5 times, between 1 and 10

## [1] 2 1 9 1 3

sample(c("Prof","Ryan","Tibs"), size=10, replace=TRUE) # Sampling 10 times,

##  [1] "Tibs" "Ryan" "Tibs" "Ryan" "Prof" "Prof" "Tibs" "Ryan" "Ryan" "Tibs"

 # from the 3 options (strings): "Prof", "Ryan", "Tibs"

Recall that a word table is a vector of counts that we can generate from a document of text—the name of each entry is a unique word in the document, and the entry itself is the number of times the corresponding word appears. Here we will consider generating a random word table, computed from a random document of text. For this random document, we’ll limit ourselves to lower case letters and spaces. Consider the following strategy.
- Generate, 1000 times total, a lower case character between “a” to “z”, or a space " “, with each of these 27 options occurring with equal probability
- Collapse this vector of random characters into one big string
- Split according to spaces
- Remove empty strings
- Compute a word table

Implement the above strategy, and display your word table. (Hint: you shouldn’t use a for() or while() loop here. No explicit looping is needed, and you can just use one call to sample(). You’ll also find the built-in vector letters useful, which is a vector of length 26 containing the lower case letters “a” through “z”.) How many different random words did you generate? What is the largest word count? What is the largest character count of any word, and what is this word (i.e., the longest word)?

Hw5 Q6 (4 points). The words that you generated in the last question are unrealistically long. This is because our sampling distribution chooses a character with equal probability (probability 1/27) between “a” through “z” and a space " “, i.e., it doesn’t place high enough probability on the space” " option. Let’s change our sampling distribution to the following: we place probability 1/52 on choosing any one of the letters “a” through “z”, and probability 1/2 on choosing a space " “. This makes it equally likely to get a space or any letter. Repeat the task in the previous question, but with this new sampling distribution. (Hint: you can still just use one call to sample(); think about changing the vector you pass in as the first input, i.e., the vector from which this function decides to pick things uniformly at random. You will want to change it so that you”stack the odds" of getting a space; you will find the function rep() useful.) Then answer the same questions as before: how many unique words? largest word count? largest character count, and corresponding word?

Lab 6w: Simulation Basics

Statistical Computing, 36-350

Wednesday October 5, 2016

Practice with random numbers

Random walking and seeds

Lots of coin tosses

Generating a random word table