You will indent code blocks (Control-I)
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: the pth quantile
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
data
#Excerpt from HW1 solutions
data <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/16/hw/01/CAPA.csv")
value.median.mean <- lm(Median_house_value ~ Median_household_income
+ Mean_household_income, data=data)
A treat: The following is code from a prominent statistician:
### CENSORED TO PROTECT THE INNOCENT ###
Use functions to tie together related commands so they can be repeated
Example: Suppose that we want to easily compute the quantiles of the Pareto distribution. The density function is: \[f(x;\alpha,x_0) = \begin{cases} \frac{\alpha-1}{x_0}\left(\frac{x}{x_0}\right)^{-\alpha} & x \ge x_0\\ 0 & x < x_0 \end{cases}\] and the quantile function is \[Q(p; \alpha,x_0) = x_0(1-p)^{-\frac{1}{\alpha-1}}\]
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: the pth quantile
qpareto.1 <- function(p, exponent, threshold) {
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
Then instead of
# alpha = 2.33, x_0 = 6e8, p = 0.5
6e8 * (1-0.5)^(-1/(2.33-1))
## [1] 1010391288
# alpha = 2.33, x_0 = 6e8, p = 0.4
6e8 * (1-0.4)^(-1/(2.33-1))
## [1] 880957225
# alpha = 2.5, x_0 = 1e6, p = 0.92
1e6 * (1-0.92)^(-1/(2.5-1))
## [1] 5386087
we can say
qpareto.1(p=0.5, exponent=2.33, threshold=6e8)
## [1] 1010391288
qpareto.1(p=0.4, exponent=2.33, threshold=6e8)
## [1] 880957225
qpareto.1(p=0.92, exponent=2.5, threshold=1e6)
## [1] 5386087
Much harder to make mistakes!
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.2 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- threshold*((1-p)^(-1/(exponent-1)))
return(q)
}
qpareto.2(p=0.92, exponent=2.5, threshold=1e6)
## [1] 5386087
qpareto.2(p=0.92, exponent=2.5, threshold=1e6, lower.tail=FALSE)
## [1] 1057162
# (Unrelated to the quantile functions)
# Function to return the biggest correlation of 5 length-10 Gaussian vectors
# Takes no arguments, so it can be used in a simulation
random.correlation.max = function(){
#Draw a 10x5 matrix of Gaussians
X = matrix(rnorm(50),nrow=10,ncol=5)
#Compute the correlation matrix
cor.matrix = cor(X)
#Find the biggest entry, making sure to avoid the diagonal (which is all 1)
return(max(cor.matrix[upper.tri(cor.matrix)]))
}
#Draw 100 samples and plot them
hist(replicate(100,random.correlation.max()),20,col='grey', main="Max. Correlation",xlab='correlation')
...
indicates pass-through arguments. You won’t use this very often, but many R
functions do.c()
or list()
to return more than one thing (We will see some examples of this)Example: From Solutions to Homework 1. You were asked to make several maps of your residuals. The following function was modified from last term.
# Set up a function to make maps
# "Terrain" color-levels set based on quantiles of the variable being plotted
# Inputs: vector to be mapped over the data frame; number of levels to
# use for colors; name of the longitude coordinate; name of the lattitude
# coordinate; data frame to take coordinates from; other plotting arguments
# Outputs: invisibly, list giving cut-points and the level each observation
# was assigned
mapper <- function(z, levels, x="LONGITUDE", y="LATITUDE", data,
legend.loc="topright", ncol=1, ...) {
# Which quantiles do we need?
probs <- seq(from=0, to=1, length.out=(levels+1))
# What are those quantiles?
z.quantiles <- quantile(z, probs)
# Assign each observation to its quantile
z.categories <- cut(z, z.quantiles, include.lowest=TRUE)
# Make up a color scale
shades <- terrain.colors(levels)
plot(x=data[,x], y=data[,y], col=shades[z.categories], ...)
legend(x=legend.loc, legend=levels(z.categories), col=shades,
ncol=ncol, pch=19)
invisible(list(quantiles=z.quantiles, categories=z.categories))
}
resample()
)Example I work on a statistics problem in inference for networks/graphs. Suppose that I want to run a hypothesis test about an edge in the graph. I might need:
This is helpful for many reasons:
It is often convenient to call functions from other functions, as we build functionality. As a simple example, we could have avoided replicating the quantile code from qpareto.1
in qpareto.2
as follows.
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
if(lower.tail==FALSE) {
p <- 1-p
}
q <- qpareto.1(p, exponent, threshold)
return(q)
}
This calls qpareto.1
to do the actual calculation, after making corrections for the lower.tail
argument.
This is very useful! You will often be working with several intertwined functions at the same time. We should be very careful to understand what happens.
A very simple example:
# A very simple function that adds one to its argument
add.one = function(x){
x+1
}
# A function that calls add.one to do its addition
add.stuff = function(x){
return(add.one(x))
}
# See? We can add!
add.stuff(3)
## [1] 4
# A terrible job rewriting add.one
add.one = function(x){
x+2
}
#Now what?
add.stuff(3)
## [1] 5
The add.stuff
function looks for add.one
each time it is called. If add.one
has changed, it uses the new version of the function. It does not save a snapshot at the time add.stuff
is defined.
This is great, since it lets you easily edit all your functions and have the changes propagate. However, you do need to be aware and careful.
Eventually, you will break something.
traceback()
: Show all the calls leading to the last error
browser()
: A convenient way to mess around inside a function -When called within a function, gives you an R shell there -Avoids problems with local variables disappearingA favorite line: Executes [Dangerous Expression]
and calls browser()
if there is an error. Great for seeing problems that only happen sometimes inside of for loops inside of functions.
tryCatch({[Dangerous Expression]},warning=function(w){browser()},
error=function(e){browser()})
Example: Suppose that I want to sample a Pareto random variable. I know that if \(F(x)\) is the CDF for a distribution and \(U\) is a uniform random variable, then \(F^{-1}(U)\) has the F distribution.
First, for good practice, define a version of the quantile function that validates inputs
# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
#browser()
stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
q <- qpareto.3(p,exponent,threshold,lower.tail)
return(q)
}
#Try it with a bad input
qpareto.4(p=0.92,exponent=2.5,threshold=-1,lower.tail=FALSE)
Now we can define a random pareto function. This version is (intentionally) buggy!
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=runif(1),exponent=exponent,threshold=threshold)
}
return(x)
}
We can run it and get an error
# If we run this block in Rmarkdown, it will break our document
# Run this in an R session.
rpareto(10)
rpareto(n=10,exponent=2.5,threshold=1)
Compare:
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.1 <- function(n,exponent,threshold) {
x <- vector(length=n)
for (i in 1:n) {
x[i] <- qpareto.4(p=runif(1),exponent=exponent,threshold=threshold)
}
return(x)
}
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.2 <- function(n,exponent,threshold) {
x = replicate(n, qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
return(x)
}
system.time(rpareto.1(10000,exponent=2.5,threshold=1))
## user system elapsed
## 0.243 0.000 0.243
system.time(rpareto.2(10000,exponent=2.5,threshold=1))
## user system elapsed
## 0.377 0.000 0.409
We could try vectorizing our function better:
#Our `qpareto1` functions can handle vector `p` arguments just fine.
qpareto.4(c(.1,.2,.3), 2.5, 1)
## [1] 1.072766 1.160397 1.268434
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x)
}
system.time(rpareto.3(10000,exponent=2.5,threshold=1))
## user system elapsed
## 0.003 0.000 0.002
Other approaches to removing iteration
use ifelse
to do conditional things to entire vectors
# Calculate Huber's loss function
# Input: vector of numbers x
# Return: x^2 for |x|<1, 2|x|-1 otherwise
huber <- function(x) {
return(ifelse(abs(x) <= 1, x^2, 2*abs(x)-1))
}
x = seq(from=-3,to=3,length.out=100)
plot(x,huber(x))
use matrix operations
# Get some long vectors
n = 100000
a = rnorm(n)
b = rnorm(n)
# Find the inner product with a for loop
system.time({
total = 0
for(i in 1:length(a)){total = total + a[i]*b[i]}
})
## user system elapsed
## 0.140 0.000 0.153
# Find the inner product with matrix operations
system.time({
a %*% b #R handles vector dimensions in a funny way here. Read ?matmult
})
## user system elapsed
## 0.001 0.000 0.001
use vector arithmetic
# Get some long vectors
n = 100000
a = rnorm(n)
b = rnorm(n)
# Find the sum with a for loop, don't preallocate memory
system.time({
total = numeric(0) #Don't actually allocate any space
for(i in 1:length(a)){total[i] = a[i]+b[i]}
})
## user system elapsed
## 21.871 0.016 21.983
# Find the sum with a for loop, allocate memory all at once
system.time({
total = numeric(n) #Allocate space ahead of time
for(i in 1:length(a)){total[i] = a[i]+b[i]}
})
## user system elapsed
## 0.183 0.000 0.183
# Find the inner product with matrix operations
system.time({
a + b #R handles vector dimensions in a funny way here. Read ?matmult
})
## user system elapsed
## 0.000 0.000 0.001
use apply
to apply a function to rows or columns of a matrix. Use sapply
to apply a function to every element of a vector.
#Make a 100x100 matrix of uniform [0,1] variables
X = matrix(runif(10000),nrow=100,ncol=100)
#Find the maximum of each column
column.maxs = apply(X, MARGIN = 2, FUN = max)
#Plot a histogram. Do you know the distribution?
hist(column.maxs,20)
A sidenote on matrix arithmetic and subsetting
When subsetting a matrix, we expect R to return another matrix.
#Make a matrix
X = matrix(rnorm(12),nrow=4,ncol=3)
X
## [,1] [,2] [,3]
## [1,] -1.5156656 -0.7527156 0.27717781
## [2,] 0.3802144 0.8397816 -0.90298396
## [3,] 0.5414839 -0.7360646 0.06724232
## [4,] 0.2960943 -0.7469858 0.45159979
#We would expect the following to all return 3x3 matrices
t(X) %*% X
## [,1] [,2] [,3]
## [1,] 2.8226820 0.8404169 -0.5933096
## [2,] 0.8404169 2.3715927 -1.3537787
## [3,] -0.5933096 -1.3537787 1.1006715
t(X[1:3,]) %*% X[1:3,]
## [,1] [,2] [,3]
## [1,] 2.7350101 1.061595 -0.7270257
## [2,] 1.0615951 1.813605 -1.0164401
## [3,] -0.7270257 -1.016440 0.8967291
t(X[1:2,]) %*% X[1:2,]
## [,1] [,2] [,3]
## [1,] 2.4418053 1.4601622 -0.7634363
## [2,] 1.4601622 1.2718139 -0.9669454
## [3,] -0.7634363 -0.9669454 0.8922076
t(X[1,]) %*% X[1,]
## [,1]
## [1,] 2.940651
What went wrong?
dim(X)
## [1] 4 3
dim(X[1:2,])
## [1] 2 3
dim(X[1,])
## NULL
When we choose a subset with at least one dimension 1, R drops those dimensions! For a 2-D matrix, this becomes a 1-D vector! Many functions treat a vector very differently.
dim(X)
## [1] 4 3
length(X)
## [1] 12
dim(X[,1])
## NULL
length(X[,1])
## [1] 4
To fix this, we can specify that the additional dimension should not be dropped.
dim(X[,1,drop=FALSE])
## [1] 4 1
length(X[,1,drop=FALSE])
## [1] 4
t(X[1,,drop=FALSE]) %*% X[1,,drop=FALSE]
## [,1] [,2] [,3]
## [1,] 2.2972423 1.1408652 -0.42010889
## [2,] 1.1408652 0.5665808 -0.20863607
## [3,] -0.4201089 -0.2086361 0.07682754
sim.ar <- function(n,x1,phi=0,sd=1) {
x <- vector(length=n)
x[1] <- x1
for (t in 2:n) {
x[t] <- phi*x[t-1]+rnorm(1,mean=0,sd=sd)
}
return(x)
}
A function can return one thing. Often you want to return more than one value. - For several values of the same type, you can return a vector - For several different sorts of values, you can return a list. You can even assign names!
For an example of returning a list, we return to the random correlation matrix example
# Function to return the correlation matrix of 5 length-10 Gaussian vectors
random.correlation.matrix = function(){
#Draw a 10x5 matrix of Gaussians
X = matrix(rnorm(50),nrow=10,ncol=5)
#Compute the correlation matrix
cor.matrix = cor(X)
cor.matrix
}
#Function to compute summaries of a correlation matrix
matrix.extremes = function(M){
#Extract the off-diagonal entries
off.diagonal = M[upper.tri(M)]
#Return the largest and smallest off-diagonal entries
return(list(max = max(off.diagonal), min = min(off.diagonal)))
}
matrix.extremes(random.correlation.matrix())
## $max
## [1] 0.6462206
##
## $min
## [1] -0.6484274
replicate(5,matrix.extremes(random.correlation.matrix()))
## [,1] [,2] [,3] [,4] [,5]
## max 0.5685958 0.5498803 0.4050465 0.4095796 0.729471
## min -0.3054157 -0.2361346 -0.5031285 -0.6958686 -0.3911191
For an example of returning a vector, we look at our rpareto function from earlier:
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
# exponent of the distribution (exponent)
# lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
return(x) #If n>1, this is a vector!
}
#This will return a vector of 10 things:
rpareto.3(10,2.5,1)
## [1] 1.022777 1.287552 1.280614 2.908297 2.329300 2.340561 2.386480
## [8] 2.973543 1.347765 2.358582