n Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
This week’s agenda: practice writing functions, creating simulations, using ``replicate’’
We are going to continue the drug effect model that was discussed in the “Simulation” lecture. That is, we will simulate the effects of using a drug and not using a drug to see hypothetically. This will allow us to investigate how different parameters of our model affect the number of subjects needed to observe a significant difference without calculating complicated math.
Suppose that there is a new drug that can be optionally given before chemotherapy. We follow the setup given in the “Simulation” lecture. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1), \] whereas those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2). \] Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution. Now consider the following scenario. In the following questions, we will set up a way to simulate this model.
1a. The first function we write will generate the simulation data. Write a function called simulate.data(n, mu.drug)
that produces measurements in the drug and no drug groups. Your function should take two inputs: n
, the sample size (i.e., number of subjects in each group), with a default value of 60; and mu.drug
, the mean for the exponential distribution that defines the drug tumor reduction measurements, with a default value of 2. Your function should return a list with two vectors called no.drug
and drug
. Each of these two vectors should have length n
, containing the percentage reduction in tumor size under the appropriate condition (not taking the drug or taking the drug). (Hint: This function should use rexp
appropriately. You can use code snippets from the slides to help you out. You’ll need to recall some properties of the Exponential distribution to make sense of the code snippets in the slides.)
1b. We will now use simulate.data()
for different seed settings to see if we are properly generating data. Run simulate.data()
without any arguments (hence, relying on the default values of n
and mu.drug
), and store the results in results.1
. Print out the first 6 values in both the results.1$no.drug
and results.1$drug
vectors. Now, run simulate.data()
again, and store its value in results.2
. Again, print out the first 6 values in both the results.2$no.drug
and results.2$drug
vectors. We have effectively simulated two hypothetical datasets. Hence, the values in each of these 4 vectors should all be different.
1c. Even though we simulated two datasets (each with completely different values), we would still like to ensure that in both datasets, the mean value of no.drug
is roughly the same across all datasets, and the value of drug
is value the same across all datasets. Compute the following three numbers: the absolute difference in the mean values of no.drug
between response.1
and response.2
, the absolute difference in the mean values of drug
between response.1
and response.2
, and the absolute difference in mean values of no.drug
and drug
in response.1
. Of these three numbers, which one is the largest, and does this make sense? (Hint: By absolute difference, we mean you should compute the difference and then take the absolute value.)
1d. Now, we want to visualize the dataset. Fortunately, the code to visualize the dataset is already provided for you in the “Simulation” lecture, but it is not written as a function. We will write a function plot.data(data)
that does this. This function has one input: data
with no default value. data
represents the dataset that simulate.data()
generates. Follow the code in Slide 22 (of 28) in the “Simulation” lecture (where you will only need to make minor modifications) so plot.data(data)
visualizes data
. Specifically, your function will plot two histograms, one for data$no.drug
(in gray) and data$drug
(in red) with the same 20 bins, overlay a density curve for each histogram in the appropriate colors, and plot a legend.
1e. We will now use plot.data()
to plot three different datasets (and thus make three different plots). Run plot.data()
on results.1
. Then, plot.data()
on results.2
. As mentioned in Question 1b and 1c, these datasets should be different. Hence, their respective plots should not be exactly the same, but they should look “similar”. Then, in one line, generate a new dataset using simulate.data()
where n
is 1000 and mu.drug
is 1.1, and plot the data using plot.data()
. In one or two sentences, describe some major differences that you see between this third plot and the first two plots.
1f. In the next section to come, we will be generating many hypothetical datasets to see how many subjects we need to observe a difference between taking the drug and not taking the drug. To do this, we will write a function called simulate.difference()
, which takes in the same two arguments as simulate.data()
, which are n
and mu.drug
, both of which use the same default parameters as simulate.data()
. This function will generate a new dataset using simulate.data()
using the appropriate inputs, and compute the difference in means between drug
and no.drug
(no absolute value). That is, the mean value of drug
minus the mean value of no.drug
. Your function should return this value. Run this function twice with no arguments (hence, using the default parameters) to see that your function is returning different numbers, and run the function once with n=1000
and mu.drug=10
. Print out all 3 return values. This last value should be substantially larger than the first two.
With your simulation set up, we can now investigate how the parameters of our simulation (namely, n
and mu.drug
) affect the outcomes. While the relationship between n
, mu.drug
and the outcome of simulate.difference()
are not too hard to mathematically derive in this particular lab, you can imagine much more complicated models where it’s easier to simulate the model instead of mathematically deriving the answer.
The next few questions will work with this hypothetical: suppose we work for a drug company that wants to put this new drug out on the market. In order to get FDA approval, your company must demonstrate that the patients who had the drug had on average a reduction in tumor size at least 100 percent greater than those who didn’t receive the drug, or in math: \[ \overline{X}_{\mathrm{drug}} - \overline{X}_{\mathrm{no\,drug}} \geq 100. \] Your drug company wants to spend as little money as possible. They want the smallest number n such that, if they were to run a clinical trial with n patients in each of the drug / no drug groups, they would likely succeed in demonstrating that the effect size (as above) is at least 100. Of course, the result of a clinical trial is random; your drug company is willing to take “likely” to mean successful with probability 0.95, i.e., successful in 190 of 200 hypothetical clinical trials (though only 1 will be run in reality).
2a. Following the code sketch provided in the “Simulation” lecture (Slide 25), write a function called rep.sim
. This function takes in 4 arguments, nreps
(the number of repetitions, with default value of 200), n
and mu.drug
(the values needed for simulate.difference()
, with the same default values) and seed
(with default value NULL
). This function should run simulate.differences()
nreps
number of times, and then return the number of success, i.e., the number of times that the output of simulate.difference()
exceeds 100. Demonstrate your function works by using it on mu.drug = 1.5
. (Note: While you could use a for-loop (shown in the slides) or one of the *apply functions, for this question, you can also use the replicate
function. Be sure to check the documentation for replicate
if you are unfamiliar with it. Essentially, replicate
takes in two arguments, the number of replications you want to perform and the expression you are replicating.)
2b. We will now investigate the effect of n
, where mu.drug
is fixed to be 2. For each value of the input n
(the sample size) in between 5 and 100 (inclusive), run your function rep.sim
. You can do this using a for-loop or one of the *apply functions, and store the number of success in a vector. So to be clear, for each sample size in between 5 and 100, you should have a corresponding number of successes. Plot the number of successes versus the sample size, and label the axes appropriately. Based on your simulation, what is the smallest sample size for which the number of successes is 190 or more?
2c. Now suppose your drug company told you they only had enough money to enlist 20 subjects in each of the drug / no drug groups, in their clinical trial. They then asked you the following question: how large would mu.drug
have to be, the mean proportion of tumor reduction in the drug group, in order to have probability 0.95 of a successful drug trial? Run a simulation, much like your simulation in the last problem, to answer this question. Specifically, similar to before, for each value of the input mu.drug
in between 0 and 5, in increments of 0.25, run your function rep.sim()
, with n=20
and nreps = 200
. Plot the number of successes versus the value of mu.drug
, and label the axes appropriately. What is the smallest value of mu.drug
for which the number of successes exceeds 190?
2d. In this last question, we will be modifying the simulation setup and see how it changes the results we observe. Here, we will not provide you with the step-by-step details of how to explicitly change the setup, and this question is a prelude to the homework questions. Here is the new setup: suppose we start with n=5
subjects (as before, this means 5 subjects with the drug, 5 subjects without the drug). We compute the difference in means between using the drug and not using the drug (just like before). If this difference is equal to or larger than 100, we declare success and stop. Here is the change: if the difference is smaller than 100, we collect 5 new subjects with the drug and 5 new subjects without the drug. Then, once again, we compute the difference in means between the subjects with the drug and the subjects without the drug, and we declare success if this difference is equal to or larger than 100. We keep incrementing by 5 new subjects with the drug and without the drug until we have a total of 60 subjects with the drug and 60 subjects without the drug. If we still do not observe a difference in means larger than 100 at this point, then we declare the a failure. Change the functions simulate.data()
, simulate.difference()
and rep.sim()
if necessary to accommodate this new scheme. Then, similar to Question 2a, run this simulation with 200 repetitions with mu.drug = 1.5
, and print out how many success there were. How does this number compare with the result in Question 2a?