Simulation basics
R gives us unique access to great simulation tools (unique compared to other languages). Why simulate? Welcome to the 21st century! Two reasons:
To sample from a given vector, use sample()
sample(x=letters, size=10) # Without replacement, the default
## [1] "n" "h" "k" "p" "b" "e" "f" "t" "l" "y"
sample(x=c(0,1), size=10, replace=TRUE) # With replacement
## [1] 1 1 0 1 0 0 0 0 0 1
sample(x=10) # Arguments set as x=1:10, size=10, replace=FALSE
## [1] 9 6 7 3 8 4 5 1 2 10
To sample from a normal distribution, we have the utility functions:
rnorm()
: generate normal random variablespnorm()
: normal distribution function, \(\Phi(x)=P(Z \leq x)\)dnorm()
: normal density function, \(\phi(x)= \Phi'(x)\)qnorm()
: normal quantile function, \(q(y)=\Phi^{-1}(y)\), i.e., \(\Phi(q(y))=y\)Replace “norm” with the name of another distribution, all the same functions apply. E.g., “t”, “exp”, “gamma”, “chisq”, “binom”, “pois”, etc.
Standard normal random variables (mean 0 and variance 1)
n = 1000
z = rnorm(n, mean=0, sd=1) # These are the defaults for mean, sd
mean(z) # Check: sample mean is approximately 0
## [1] 0.01749316
var(z) # Check: sample variance is approximately 1
## [1] 0.9824088
To compute empirical cumulative distribution function (ECDF)—the standard estimator of the cumulative distribution function—use ecdf()
x = seq(-3,3,length=100)
ecdf.fun = ecdf(z) # Create the ECDF
class(ecdf.fun) # It's a function!
## [1] "ecdf" "stepfun" "function"
ecdf.fun(0)
## [1] 0.499
# We can plot it
plot(ecdf.fun, lwd=2, col="red", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
To compute histogram—a basic estimator of the density based on binning—use hist()
hist.obj = hist(z, breaks=30, plot=FALSE)
class(hist.obj) # It's a list
## [1] "histogram"
hist.obj$breaks # These are the break points that were used
## [1] -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6
## [15] -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
## [29] 2.4 2.6 2.8 3.0 3.2 3.4 3.6
hist.obj$density # These are the estimated probabilities
## [1] 0.010 0.000 0.005 0.015 0.035 0.035 0.055 0.115 0.135 0.195 0.195
## [12] 0.215 0.345 0.345 0.440 0.355 0.300 0.370 0.350 0.355 0.285 0.300
## [23] 0.220 0.110 0.060 0.020 0.070 0.025 0.015 0.015 0.000 0.005 0.000
## [34] 0.005
# We can plot it
plot(hist.obj, col="pink", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("pink","black"))
Interesting statistical facts:
density()
uses a kernel density estimatorPseudorandomness and seeds
Not surprisingly, we get different draws each time we call rnorm()
mean(rnorm(n))
## [1] 0.01539173
mean(rnorm(n))
## [1] -0.02792472
mean(rnorm(n))
## [1] -0.01196745
mean(rnorm(n))
## [1] -0.007461068
Random numbers generated in R (in any language) are not “truly” random; they are what we call pseudorandom
?Random
in your R console to read more about this (and to read how to change the algorithm used for pseudorandom number generation, which you should never really have to do, by the way)All pseudorandom number generators depend on what is called a seed value
set.seed()
# Getting the same 5 random normals over and over
set.seed(0); rnorm(5)
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
set.seed(0); rnorm(5)
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
set.seed(0); rnorm(5)
## [1] 1.2629543 -0.3262334 1.3297993 1.2724293 0.4146414
# Different seeds, different numbers
set.seed(1); rnorm(5)
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
set.seed(2); rnorm(5)
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176
set.seed(3); rnorm(5)
## [1] -0.9619334 -0.2925257 0.2587882 -1.1521319 0.1957828
# Each time the seed is set, the same sequence follows (indefinitely)
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
set.seed(0); rnorm(3); rnorm(2); rnorm(1)
## [1] 1.2629543 -0.3262334 1.3297993
## [1] 1.2724293 0.4146414
## [1] -1.53995
Iteration and simulation
What would you do if you had such a model, and your scientist collaborators asked you: how many patients would we need to have in each group (drug, no drug), in order to reliably see that the average reduction in tumor size is large?
# Simulate, supposing 60 subjects in each group
set.seed(0)
n = 60
mu.drug = 2
mu.nodrug = runif(n, min=0, max=1)
x.drug = 100*rexp(n, rate=1/mu.drug)
x.nodrug = 100*rexp(n, rate=1/mu.nodrug)
# Find the range of all the measurements together, and define breaks
x.range = range(c(x.nodrug,x.drug))
breaks = seq(min(x.range),max(x.range),length=20)
# Produce hist of the non drug measurements, then drug measurements on top
hist(x.nodrug, breaks=breaks, probability=TRUE, xlim=x.range,
col="lightgray", xlab="Percentage reduction in tumor size",
main="Comparison of tumor reduction")
# Plot a histogram of the drug measurements, on top
hist(x.drug, breaks=breaks, probability=TRUE, col=rgb(1,0,0,0.2), add=TRUE)
# Draw estimated densities on top, for each dist
lines(density(x.nodrug), lwd=3, col=1)
lines(density(x.drug), lwd=3, col=2)
legend("topright", legend=c("No drug","Drug"), lty=1, lwd=3, col=1:2)
set.seed()
Consider the code below for a generic simulation. Think about how you would frame this for the drug effect example, which you’ll revisit in lab
# Function to do one simulation run
one.sim = function(param1, param2=value2, param3=value3) {
# Possibly intricate simulation code goes here
}
# Function to do repeated simulation runs
rep.sim = function(nreps, param1, param2=value2, param3=value3, seed=NULL) {
# Set the seed, if we need to
if(!is.null(seed)) set.seed(seed)
# Run the simulation over and over
sim.objs = vector(length=nreps, mode="list")
for (r in 1:nreps) {
sim.objs[r] = one.sim(param1, param2, param3)
}
# Aggregate the results somehow, and then return something
}
Sometimes simulations take a long time to run, and we want to save intermediate or final output, for quick reference later
There two different ways of saving things from R (there are more than two, but here are two useful ones):
saveRDS()
: allows us to save single R objects (like a vector, matrix, list, etc.), in (say) .rds format. E.g.,
saveRDS(my.mat, file="my.matrix.rds")
save()
: allows us to save any number of R objects in (say) .rdata format. E.g.,
save(mat.x, mat.y, list.z, file="my.objects.rdata")
Note: there is a big difference between how these two treat variable names
Corresponding to the two different ways of saving, we have two ways of loading things into R:
readRDS()
: allows us to load an object that has been saved by savedRDS()
, and assign a new variable name. E.g.,
my.new.mat = readRDS("my.matrix.rds")
load()
: allows us to load all objects that have been saved through save()
, according to their original variables names. E.g.,
load("my.objects.rdata")
set.seed()