Name:
Andrew ID:
Collaborated with:
On this homework, you can collaborate with your classmates, but you must identify their names above, and you must submit your own homework as an knitted HTML file on Canvas, by Sunday 10pm, this week.
1a. Generate the following objects, save them to variables, and call head()
on those variables:
x
which contains 100 draws from a \(\mathrm{Uniform}(0, 1)\) distribution and a variable y
which contains 100 draws from the corresponding \(\mathrm{Uniform}(0, x[i])\) (thus if x[i] = 0.3
then y[i]
comes from a \(\mathrm{Uniform}(0, 0.3)\). Do this using iteration.1b. We’ve written a function plot_cumulative_means()
below which plots cumulative sample mean as the sample size increases. The argument rfun
takes a single-argument function which generates n
random numbers when called as rfun(n)
. The argument n_max
is an integer which tells us how many samples to draw. The function plots the cumulative mean against the number of samples.
# plot_cumulative_means
# inputs:
# rfun: a function which generates n random draws when called as rfun(n)
# n_max: the number of samples to draw
# side-effects:
# Plots the cumulative means against sample size
plot_cumulative_means = function(rfun, n_max) {
samples = rfun(n_max)
plot(seq_len(n_max), cumsum(samples) / seq_len(n_max), type = "l")
}
Use this function to make plots for the following distributions
Hint: you can construct a new one-argument random number generating function using function(n) rfun(n, parameters, go, here, ...)
.
Do the sample means start concentrating around a single value as the sample size increases?
Challenge: Create a random number generating function whose sample mean will not converge and show the corresponding plot.
1c. For the same distributions as 1b do the following:
To avoid needless repetition we’ll write a function plot_ecdf(rfun, pfun, sizes)
which takes as its arguments the single-argument random number generating function rfun
, the corresponding single-argument conditional density function pfun
, and a vector of sample sizes sizes
for which to plot the ecdf.
Fill in the function below by editing the lines with “##” and “??” and then run it on the above distributions.
# plots empirical cdfs with true cdf
# inputs:
# rfun: a function which generates n random draws when called as rfun(n)
# pfun: a function which calculates the true cdf at x when called as pfun(x)
# sizes: a vector of sample sizes
# side-effects:
# Plots the true CDF and empirical CDFS for each samples size in sizes.
plot_ecdf = function(rfun, pfun, sizes) {
# Draw the random numbers
## samples = lapply(sizes, ??)
# Calculate the grid for the CDF
grid_min = min(sapply(samples, min))
grid_max = max(sapply(samples, max))
grid = seq(grid_min, grid_max, length.out = 1000)
# Calculate the empirical cdfs
## ecdfs = lapply(samples, ??)
evals = lapply(ecdfs, function(x) x(grid))
# Plot the true CDF
## plot(grid, ??, type = "l", col = "black", xlab = "x", ylab = "P(X > x)")
# Plot the empirical cdfs
n_sizes = length(sizes)
cols = rainbow(n_sizes)
lapply(1:n_sizes, function(ii) {
lines(grid, evals[[ii]], col = cols[ii])
})
legend("bottomright", legend = sizes, col = cols, lwd = 1)
}
Finally, examine the plots and discuss how the empirical estimates perform as the size of the data increases.
1d. Suppose you are collecting data from visitors to your website and wish to randomly show them version A or B (more on this later). However visitors might become confused if the version changes every time they visit the page. So if they’re assigned to condition A on their first visit you would like for them to be assigned to condition A for every subsequent visit.
Seeds are useful to ensure the replicability of random assignments. If we used a unique identifier of the visitors such as an integer id
, we could use that as the seed for the random assignment. Since the same visitor will always have the same id
they will always have the same seed and thus receive the same random assignment.
Write a function assign_ab(id)
which returns “A” or “B” with equal probability but with the constraint that it must return the same result for each call with the same id
. So repeated calls of assign_ab(id=1)
will always return the same assignment.
Finally, verify that sapply(c(1:100, 1:100), assign_ab)
fulfills our constraint and that the assignments are roughly even.
Challenge Can you write a function which solves the same problem without using the seeding trick but instead looks up whether or not the id has been seen before. If it has been seen the previous result is returned. If it hasn’t then a new random number is generated. However there are some restrictions: the ids can only be processed one-at-a-time and you cannot access global variables.
A common data science task is to analyze the results of an AB test. AB tests are essentially controlled experiments: we obtain data from two different conditions such as the different versions of your website to try to determine which condition gives /better/ results.
2a. Write a function to simulate collecting data from an AB test where the response from the A condition follow a Gaussian distribution with mean a_mean
and standard deviation a_sd
while responses from the B condition follow a Gaussian distribution with mean b_mean
and standard deviation b_sd
.
Your function signature should be ab_collect(n, a_mean, a_sd, b_mean, b_sd, n)
where n
is the number of samples to collect from each condition and the other arguments are as described above. Your function should return a list with two named components a_responses
and b_responses
which contain the responses for each condition respectively. Try your function out for several values of a_mean
, a_sd
, b_mean
, and b_sd
and check that the sample means and standard deviations approximately match the theoretical values.
2b. Write a function test_at_end(n, a_mean, a_sd, b_mean, b_sd)
which uses your simulation function from 2a to draw samples of size n
and then runs a t-test to determine whether there is a significant difference (we’ll define this as having a p-value <= 0.05). If there is a significant difference return either “A” or “B” for whichever condition has the higher mean. If there isn’t no significant difference return “Inconclusive”.
Run your code on the example where n = 2000
, a_mean = 100
, a_sd = 20
, b_mean = 140
, b_sd = 10
.
2c. Waiting until you collect all of the samples will take a while. So you instead decide to take the following approach. Every day you collect 100 new observations from each condition. At the end of the day you check whether or not the difference is significant. If the difference is significant you declare the higher response to be the winner. If the difference is not significant you continue onto the next day. As before, if you collect all of the samples without finding a significant different you’ll declare the result “Inconclusive”. Your function should return a list with the winner (or “Inconclusive”) and the amount of data you needed to collect.
Write a function test_as_you_go(n_per_day, n_days, a_mean, a_sd, b_mean, b_sd)
to implement this procedure. Hint: use a for()
loop which goes through each day adding more data and checking the results of the t-test.
Run this function on the same example as before with n_per_day=100
and n_days=20
(to match final sample sizes). Do you get the same result? Do you save time collecting data?
2d. Most of your AB tests won’t have a clear winner; instead both conditions A and B will be roughly equivalent. In this case you want to avoid false positives: saying there’s a difference when there isn’t really a difference (with respect to the true distributions). Let’s run a simulation that checks the false positive rate of the two testing regimes.
Setting the significance level to \(\alpha = 0.05\) of a t-test will set the proportion of false positives in the case where there is no true difference to \(0.05\). Setting a_mean = b_mean = 100
, a_sd = b_sd = 20
, and retaining the number of samples as in the previous examples conduct 1000 AB experiments using the previous two setups. Calculate the number of “Inconclusive” results. Is this what you would expect to see for \(\alpha = 0.05\)? Does either method perform better than the other?
It’s almost March madness time so let’s simulate some tournaments. One model of competition is the Bradley-Terry model. Each team is assigned an ability score (team \(i\) has score \(z_{i}\), team \(j\) has score \(z_{j}\), etc). Then for each game the probability that team \(i\) defeats team \(j\) is \(\frac{z_{i}}{z_{i} + z_{j}}\).
3a. Write a function bt_compete(z1, z2)
which takes as its arguments the scores z1
and z2
of two teams. Your function then simulates a game between the two teams (with probabilities of victory as described above) and returns 1 if the first team won and 2 if the second team won. Hint: look at the documentation for the probs
argument for sample()
To test your function run it 10,000 times for several values of z1 and z2 and check that the proportion of times each team wins is roughly equal to the probability (as per the Law of Large Numbers).
3b. Let’s now simulate the NCAA tournament with a function bt_tournament(scores, names)
where scores
is a vector of 2^m
scores and names
is the team names. We start by simulating a round of games. Team 2k-1
will play team 2k
with the loser being eliminated. Thus if we started with 2n
teams after a single round we have n
teams. We then take all of the winners (in the same order as they started) and continue this process until there’s only a single team left. Our function returns the name of the winning team. Fill in the following function skeleton to finish this function
# Simulates a single-elimination tournament
# inputs:
# scores: a vector of Bradley-Terry scores. Length must be a power of two.
# names: a vector of identifiers for teams. Length must be the same as scores
# outputs:
# The name of the victorious team.
bt_tournament = function(scores, names = seq_along(scores)) {
n_left = length(scores)
# Make sure it has length equal to a power of two
stopifnot((n_left != 0) && !bitwAnd(n_left , (n_left - 1)))
stopifnot(length(scores) == length(names))
while (n_left != 1) { # iterate until we only have one team remaining
for (ii in seq_len(n_left / 2)) {
winner = bt_compete(scores[2 * ii - 1], scores[2 * ii]) # returns 1 or 2
# Move winner to ii-th place for next round; we can reuse the
# array since we don't access element ii until the next round.
## winner_index = ??
scores[ii] = scores[winner_ind]
names[ii] = names[winner_index] # keep names in sync with scores
}
n_left = n_left / 2
}
## return(names[??])
}
Now create 32 teams with scores initialized as Gamma(5, 10)
random variables and then run 10000 tournaments (with the scores held constant). Report how many times the best team won. Plot the scatterplot of score vs the number of times a team wins the tournament.
Challenge A common complaint heard during the tournament is that “my team got a bad seeding” (and notice how you never hear someone discuss their good seeding). We might be curious, what is the effect of seeding on one’s probability of winning the tournament.
The following function will provide the usual seeding for single-elimination tournament (the NCAA is a bit different).
tournament_order = function(scores) {
n = length(scores)
stopifnot((n != 0) && !bitwAnd(n, (n - 1)))
seeding = c(1, 2)
while(length(seeding) != n) {
new_seeding = rep(NA, length(seeding) * 2)
new_seeding[c(TRUE, FALSE)] = seeding
new_seeding[c(FALSE, TRUE)] = length(new_seeding) + 1 - seeding
seeding = new_seeding
}
return(order(scores, decreasing = TRUE)[seeding])
}
We’ll then compare against a “worst” case seeding determined by the following function.
worst_order = function(scores) {
return(order(scores, decreasing = TRUE))
}
First, explain how these functions work and why they’re named as they are. Next, using the same scores as before run 10,000 tournaments with these two seedings. Record the number of times the /best/ team wins under each seeding. Is there a difference between the two? Should we conclude anything about whether seeding matters?