Statistical Computing, 36-350
Wednesday December 2, 2015
t random variables, with 5 df:
n = 1000
t.draws = rt(n,df=5) # t5 random variables
mean(t.draws) # Check: mean approx 0
## [1] 0.0524523
var(t.draws) # Check: variance approx 5/3
## [1] 1.835184
Empirical distribution function:
# Let's plot the empirical distribution function
t.ecdf = ecdf(t.draws)
plot(t.ecdf, lwd=2, col="blue")
# Let's also plot the actual distribution function on top
x = seq(-6,6,length=100)
lines(x, pt(x,df=5), lwd=2, lty=2)
legend("topleft", lty=c(2,1), lwd=3,
col=c("black","blue"), legend=c("Actual","Estimated"))
# Note: t.ecdf is an actual function! We can evaluate it
t.ecdf(1)
## [1] 0.8
pt(1,df=5)
## [1] 0.8183913
Histogram and empirical density functions:
# Let's plot a histogram of our t5 random variables
hist(t.draws, breaks=40, probability=TRUE, col="lightgray")
# Let's also draw the t5 density on top
x = seq(-6,6,length=100)
lines(x, dt(x,df=5), lwd=3, lty=2)
# Let's also draw an empirical (estimated) density on top
lines(density(t.draws), col="red", lwd=3)
legend("topright", lty=c(2,1), lwd=3,
col=c("black","red"), legend=c("Actual","Estimated"))
Advanced fact for you to remember (just if you are curious): it’s a lot easier to estimate a distribution function, than to estimate a density function!
R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:
# Simulate from model, supposing 50 subjects in each group
set.seed(0)
n = 50
mu.drug = 2
mu.nodrug = runif(n,min=0,max=1)
x.drug = 100*rexp(n,rate=1/mu.drug)
x.nodrug = 100*rexp(n,rate=1/mu.nodrug)
summary(x.drug)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.454 72.050 165.000 207.400 258.600 966.600
summary(x.nodrug)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4944 10.1800 33.7700 46.8000 73.2100 205.2000
# Find the range of all the measurements together
x.range = range(c(x.nodrug,x.drug))
# Here we are manually defining breaks for the hist() function
breaks = seq(min(x.range),max(x.range),length=20)
# Produce a histogram of the non drug measurements
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
xlab="Percentage reduction in tumor size",
col="lightgray", main="Comparison of tumor reduction")
# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, add=TRUE,
col=rgb(1,0,0,0.2)) # Transparent color! r, g, b, alpha
lines(density(x.nodrug),lwd=3)
lines(density(x.drug),lwd=3,col="red")
legend("topright", lty=1, lwd=3, col=c("black","red"),
legend=c("No drug","Drug"))
Consider the following:
A thought experiment, first. Given the model for tumor reduction with and without drugs, is it even possible to see an average difference of 100 percent between the two groups?
The larger \(n\) gets, the closer the averages get to the (population) means under each model. Well \[ E[X_{\mathrm{drug}}] = 100 \cdot E[\mathrm{Exp}(\mathrm{mean}=2)] = 200 \] And \[ E[X_{\mathrm{no drug}}] = 100 \cdot E[\mathrm{Exp}(\mathrm{mean}=R)] < 100 \cdot E[\mathrm{Exp}(\mathrm{mean}=1)] = 100 \] since, recall, \(R \sim \mathrm{Unif}(0,1)\). So the difference in means is at least 100. This means that it’s certainly possible for the difference in averages to be at least 100
Enter your unique ID here:
Work through the following problems (go ahead and fill in the code below)
# 1. Write a function around the simulation code above, that produces measurements in
# the drug and no drug groups. Your function should be called sim.drug.effect(), and
# it should take two arguments:
# - n, the number of subjects in each group, with a default value of 50
# - mu.drug, the mean for the exponential distribution that defines the drug
# tumor reduction measurements, with a default value of 2
# Your function should return the average difference in tumor reduction between the
# subjects who received the drug, and those who didn't.
# Note: this function should NOT call set.seed(), you want its results to be random
# 2. For each value of n in between 5 and 200, perform the following. Run
# sim.drug.effect() a total of 100 times, with the given value of n; record
# the average difference in tumor reduction; and then count the number of
# successes, i.e., the number of times (out of 100) that this difference
# exceeds 100. Make a line plot which shows the number of successes as a
# function of n. Label the axes appropriately. What is the smallest n for
# which the number of successes exceeds 95?
# (Hint: for this task, a double for() loop would work just fine ...)
# 3. Now suppose your drug company told you they only had enough money to
# enlist n=20 subjects in each of the drug / no drug groups, in their clinical
# trial. They then asked you the following question: how large would mu.drug
# have to be, the mean proportion of tumor reduction in the drug group, in order
# to have probability 0.95 of a successful drug trial? Run a simulation,
# much like your simulation in problem 2, to answer this question.
# (Hint: now you will want to let mu.drug vary over a wide range, say 1.5 to 5,
# and for each value, simulate data with n=20 from the drug / no drug groups,
# compute difference of average tumor reduction percentages, etc.)
# 4. So, it turns out that the drug company can actually control mu.drug, the
# mean proportion of tumor reduction among the drug subject, by adjusting the
# dose concentration of some secret special chemical. But there is no free
# lunch: the higher concentration of this secret chemical, the more likely
# a subject is to have liver failure. In particular, suppose that:
# - people on the drug die with probability sqrt(mu/4000)
# - the FDA has a policy that if 2 subjects die in a clinical trial, then
# the trial is shut down
# - in this case, the trial is clearly not counted as a success (even if
# the average difference in tumor reduction percentage was huge, between
# surviving members of the two groups)
# As in problem 3, suppose that the drug company only has enough money to enlist
# n=30 people in each of the drug / no drug groups in their clinical trial. Adapt
# your simulation from problem 3 to incorporate the fact that patients can die
# from liver failure, in the drug group. Count the number of successes (out of
# 100) as a function of mu.drug. Is there any hope here? I.e., is there a value
# for which we have at least 95 successes?
# Bonus. Suppose that the model for tumor reduction for non drug subjects is pretty
# accurate, but that for the drug subjects can be refined. Suppose in particular
# that there are three types of patients:
# - non-responders, making up 25% of the general population, for whom mu.drug is
# distributed as Unif(0,1)
# - regular-responders, making up 65% of the population, for whom mu.drug = 2
# - super-responders, making up 10% of the population, for whom mu.drug = 4
# Update your function sim.drug.effect() from problem 1 to accommodate this more
# accurate model the drug group of patients. Rerun your simulation from problem 2
# and answer the same questions as before. Does the required sample size n get
# smaller or bigger?
# (Hint: for simulating from this more accurate model, you'll want to flip a coin
# each time to determine what type of patient you are looking at: a non-responder,
# regular-responder, or super-response. The coin should be three-sided and have
# probabilities 0.24, 0.85, and 0.1 of landing on each side; you can use runif()
# to build such a coin ...)