Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 5 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 5 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 6pm on Sunday October 9. This document contains 14 of the 45 total points for Homework 5.

Generating a random word table

Recall the following strategy, from Lab 6w and Hw5 Q6, for generating a random word table.
- Generate, 1000 times total, a lower case character between “a” to “z”, or a space " “, with each letter occurring with probability 1/52 and the space occurring with probability 1/2
- Collapse this vector of random characters into one big string
- Split according to spaces
- Remove empty strings
- Compute a word table

Write a function called random.wordtab() to generate a random word table according to the above recipe, essentially, just function-izing your code from Hw5 Q6. The inputs should be nchar, the number of total characters generated in the initial random generation of characters (letters or strings), with a default of 1000; and seed, an integer to use in set.seed(), with a default of NULL, meaning that no seed should be set. The return value should be a list, with the following named elements: wordtab, the random word table that was generated; number.unique.words, the number of unique words in the word table; max.word.count, the largest word count in the word table; and max.word.length, the largest character count of any word in the word table (i.e., length of the longest word). Run your function with the default inputs and display the results.

Run your function random.wordtab() created in the previous question 1000 times, to generate 1000 random word tables. From these 1000 word tables, save the maximum word lengths into a vector called max.word.lengths. (Hint: use a for() loop.) Plot a histogram of max.word.lengths, with an appropriate setting for breaks, appropriately labeled axes, and an appropriate title. Briefly describe the distribution that you see.

Hw5 Q7 (2 points). Call your function random.wordtab() with nchar set to 1e7 (and seed remaining NULL). Report the number of unique words, the max word count, and the max word length. Then save the word table (not the whole reutrned list, just the word table), using saveRDS(), to a file called “<myandrewID>_wordtab.rds“, where for <myandrewID> is your andrew ID. Submit this file along with your knitted HTML file, when you submit the homework.

Simulating the effect of a drug on tumor reduction

Recall from the “Iteration and Simulation” mini-lecture our thought experiment drug effects. We suppose that there is a new drug that can be optionally given before chemotherapy. We believe those who aren’t given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{no\,drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=R), \;\;\; R \sim \mathrm{Unif}(0,1), \] whereas those who were given the drug experience a reduction in tumor size of percentage \[ X_{\mathrm{drug}} \sim 100 \cdot \mathrm{Exp}(\mathrm{mean}=2). \] Here \(\mathrm{Exp}\) denotes the exponential distribution, and \(\mathrm{Unif}\) the uniform distribution. Now consider the following scenario. (You don’t have to do anything, just read this carefully as a setup for what’s to come.)
- You work for a drug company that wants to put this new drug out on the market.
- But in order to get FDA approval, your company must demonstrate that the patients who had the drug had on average a reduction in tumor size at least 100 percent greater than those who didn’t receive the drug, or in math: \[ \overline{X}_{\mathrm{drug}} - \overline{X}_{\mathrm{no\,drug}} \geq 100. \]
- Your drug company wants to spend as little money as possible. They want the smallest number n such that, if they were to run a clinical trial with n patients in each of the drug / no drug groups, they would likely succeed in demonstrating that the effect size (as above) is at least 100.
- Of course, the result of a clinical trial is random; your drug company is willing to take “likely” to mean successful with probability 0.95, i.e., successful in 950 of 1000 hypothetical clinical trials (though only 1 will be run in reality).
Write a function around the simulation code from the “Iteration and Simulation” mini-lecture, that produces measurements in the drug and no drug groups. Your function should be called sim.drug.effect(), and it should take two inputs: n, the sample size (i.e., number of subjects in each group), with a default value of 60; and mu.drug, the mean for the exponential distribution that defines the drug tumor reduction measurements, with a default value of 2. Your function should return the average difference in tumor reduction between the subjects who received the drug, and those who didn’t. (Note: this function shouldn’t call set.seed(), you want its results to be random.)
Run your function sim.drug.effect(), with the default inputs, a total of 1000 times. For each of these 1000 trials, record the average difference in tumor reduction. Then report the number of successes, i.e., the number of times (out of 1000) that this difference exceeds 100. (Hint: use a for() loop.)
For each value of the input n (the sample size) in between 5 and 100, run your function sim.drug.effect() a total of 1000 times. For each of these 1000 trials, record the average difference in tumor reduction from each run; then count the number of successes, i.e., the number of times (out of 1000) that this difference exceeds 100. (Hint: use a double for() loop.) So to be clear, for each sample size in between 5 and 100, you should have a corresponding number of successes. Plot the number of successes versus the sample size, and label the axes appropriately. What is the smallest sample size for which the number of successes exceeds 950?

Hw5 Q8 (6 points). Now suppose your drug company told you they only had enough money to enlist 20 subjects in each of the drug / no drug groups, in their clinical trial. They then asked you the following question: how large would mu.drug have to be, the mean proportion of tumor reduction in the drug group, in order to have probability 0.95 of a successful drug trial? Run a simulation, much like your simulation in the last problem, to answer this question. Specifically, for each value of the input mu.drug in between 1.5 and 5, in increments of 0.1, run your function sim.drug.effect(), with n=20, a total of 1000 times. As before, for each of these 1000 trials, record the average difference in tumor reduction from each run; then count the number of successes, i.e., the number of times (out of 1000) that this difference exceeds 100. (Hint: use a double for() loop again.) Plot the number of successes versus the value of mu.drug, and label the axes appropriately. What is the smallest value of mu.drug for which the number of successes exceeds 950?

Hw5 Q9 (6 points). It turns out that the drug company can actually control mu.drug, the mean proportion of tumor reduction among the drug subject, by adjusting the dose concentration of some secret special chemical. But there is no free lunch: the higher concentration of this secret chemical, the more likely a subject is to have liver failure. In particular, suppose that each patient who is on the drug dies with probability mu.drug/100. The FDA has a policy that if one or more subjects die in a clinical trial, then the trial is shut down. In this case, the trial is clearly not counted as a success (even if the average difference in tumor reduction percentage was huge, between surviving members of the two groups).

As in the last question, suppose that the drug company only has enough money to enlist n=20 people in each of the drug / no drug groups in their clinical trial. Adapt your simulations from the last question to incorporate the fact that patients can die from liver failure, in the drug group, as described. (Hint: you can do this with only a careful but minor modification to the code. After computing the average difference reduction in tumor size between the drug / no drug groups from a simulation run, add some code to flip a coin 20 times with the “right” probability for heads, using rbinom(). If the number of heads is greater or equal to 1, then reset the average reduction in tumor size to be 0, because this trial cannot be counted as a success, according go the FDA rules.) As before, plot the number of successes (out of 1000) as a function of mu.drug. Is there any hope here, i.e., is there a value of mu.drug for which we have at least 950 successes?

Lab 6f: Iteration and Simulation

Statistical Computing, 36-350

Friday October 7, 2016

Generating a random word table

Simulating the effect of a drug on tumor reduction