Style

A treat: The following is code from a prominent statistician:

### CENSORED TO PROTECT THE INNOCENT ###

Functions

When should we write functions?

Example: From Solutions to Homework 1. You were asked to make several maps of your residuals. The following function was modified from last term.

# Set up a function to make maps
  # "Terrain" color-levels set based on quantiles of the variable being plotted
# Inputs: vector to be mapped over the data frame; number of levels to
  # use for colors; name of the longitude coordinate; name of the lattitude
  # coordinate; data frame to take coordinates from; other plotting arguments
# Outputs: invisibly, list giving cut-points and the level each observation
  # was assigned
mapper <- function(z, levels, x="LONGITUDE", y="LATITUDE", data,
                   legend.loc="topright", ncol=1, ...) {
    # Which quantiles do we need?
    probs <- seq(from=0, to=1, length.out=(levels+1))
    # What are those quantiles?
    z.quantiles <- quantile(z, probs)
    # Assign each observation to its quantile
    z.categories <- cut(z, z.quantiles, include.lowest=TRUE)
    # Make up a color scale
    shades <- terrain.colors(levels)
    plot(x=data[,x], y=data[,y], col=shades[z.categories], ...)
    legend(x=legend.loc, legend=levels(z.categories), col=shades,
           ncol=ncol, pch=19)
    invisible(list(quantiles=z.quantiles, categories=z.categories))
}

Chaining functions together

Example I work on a statistics problem in inference for networks/graphs. Suppose that I want to run a hypothesis test about an edge in the graph. I might need:

This is helpful for many reasons:

Calling functions from other functions

It is often convenient to call functions from other functions, as we build functionality. As a simple example, we could have avoided replicating the quantile code from qpareto.1 in qpareto.2 as follows.

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
  # flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.3 <- function(p, exponent, threshold, lower.tail=TRUE) {
  if(lower.tail==FALSE) {
    p <- 1-p
  }
  q <- qpareto.1(p, exponent, threshold)
  return(q)
}

This calls qpareto.1 to do the actual calculation, after making corrections for the lower.tail argument.

This is very useful! You will often be working with several intertwined functions at the same time. We should be very careful to understand what happens.

A very simple example:

# A very simple function that adds one to its argument
add.one = function(x){
  x+1
}

# A function that calls add.one to do its addition
add.stuff = function(x){
  return(add.one(x))
}

# See? We can add!
add.stuff(3)
## [1] 4
# A terrible job rewriting add.one
add.one = function(x){
  x+2
}

#Now what?
add.stuff(3)
## [1] 5

The add.stuff function looks for add.one each time it is called. If add.one has changed, it uses the new version of the function. It does not save a snapshot at the time add.stuff is defined.

This is great, since it lets you easily edit all your functions and have the changes propagate. However, you do need to be aware and careful.

Debugging

Eventually, you will break something.

Example: Suppose that I want to sample a Pareto random variable. I know that if \(F(x)\) is the CDF for a distribution and \(U\) is a uniform random variable, then \(F^{-1}(U)\) has the F distribution.

First, for good practice, define a version of the quantile function that validates inputs

# Calculate quantiles of the Pareto distribution
# Inputs: desired quantile (p)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
  # flag for whether to give lower or upper quantiles (lower.tail)
# Outputs: the pth quantile
qpareto.4 <- function(p, exponent, threshold, lower.tail=TRUE) {
  #browser()
  stopifnot(p >= 0, p <= 1, exponent > 1, threshold > 0)
  q <- qpareto.3(p,exponent,threshold,lower.tail)
  return(q)
}
#Try it with a bad input
qpareto.4(p=0.92,exponent=2.5,threshold=-1,lower.tail=FALSE)

Now we can define a random pareto function. This version is (intentionally) buggy!

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto <- function(n,exponent,threshold) {
  x <- vector(length=n)
  for (i in 1:n) {
    x[i] <- qpareto.4(p=runif(1),exponent=exponent,threshold=threshold)
  }
  return(x)
}

We can run it and get an error

# If we run this block in Rmarkdown, it will break our document
# Run this in an R session.
rpareto(10)
rpareto(n=10,exponent=2.5,threshold=1)

Avoid iteration

Compare:

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.1 <- function(n,exponent,threshold) {
  x <- vector(length=n)
  for (i in 1:n) {
    x[i] <- qpareto.4(p=runif(1),exponent=exponent,threshold=threshold)
  }
  return(x)
}

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.2 <- function(n,exponent,threshold) {
  x = replicate(n, qpareto.4(p=runif(1),exponent=exponent,threshold=threshold))
  return(x)
}

system.time(rpareto.1(10000,exponent=2.5,threshold=1))
##    user  system elapsed 
##   0.243   0.000   0.243
system.time(rpareto.2(10000,exponent=2.5,threshold=1))
##    user  system elapsed 
##   0.377   0.000   0.409

We could try vectorizing our function better:

#Our `qpareto1` functions can handle vector `p` arguments just fine.
qpareto.4(c(.1,.2,.3), 2.5, 1)
## [1] 1.072766 1.160397 1.268434
# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
  x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
  return(x)
}
system.time(rpareto.3(10000,exponent=2.5,threshold=1))
##    user  system elapsed 
##   0.003   0.000   0.002

Other approaches to removing iteration

A sidenote on matrix arithmetic and subsetting

When subsetting a matrix, we expect R to return another matrix.

#Make a matrix
X = matrix(rnorm(12),nrow=4,ncol=3)
X
##            [,1]       [,2]        [,3]
## [1,] -1.5156656 -0.7527156  0.27717781
## [2,]  0.3802144  0.8397816 -0.90298396
## [3,]  0.5414839 -0.7360646  0.06724232
## [4,]  0.2960943 -0.7469858  0.45159979
#We would expect the following to all return 3x3 matrices
t(X) %*% X
##            [,1]       [,2]       [,3]
## [1,]  2.8226820  0.8404169 -0.5933096
## [2,]  0.8404169  2.3715927 -1.3537787
## [3,] -0.5933096 -1.3537787  1.1006715
t(X[1:3,]) %*% X[1:3,]
##            [,1]      [,2]       [,3]
## [1,]  2.7350101  1.061595 -0.7270257
## [2,]  1.0615951  1.813605 -1.0164401
## [3,] -0.7270257 -1.016440  0.8967291
t(X[1:2,]) %*% X[1:2,]
##            [,1]       [,2]       [,3]
## [1,]  2.4418053  1.4601622 -0.7634363
## [2,]  1.4601622  1.2718139 -0.9669454
## [3,] -0.7634363 -0.9669454  0.8922076
t(X[1,]) %*% X[1,]
##          [,1]
## [1,] 2.940651

What went wrong?

    dim(X)
## [1] 4 3
    dim(X[1:2,])
## [1] 2 3
    dim(X[1,])
## NULL

When we choose a subset with at least one dimension 1, R drops those dimensions! For a 2-D matrix, this becomes a 1-D vector! Many functions treat a vector very differently.

dim(X)
## [1] 4 3
length(X)
## [1] 12
dim(X[,1])
## NULL
length(X[,1])
## [1] 4

To fix this, we can specify that the additional dimension should not be dropped.

dim(X[,1,drop=FALSE])
## [1] 4 1
length(X[,1,drop=FALSE])
## [1] 4
t(X[1,,drop=FALSE]) %*% X[1,,drop=FALSE]
##            [,1]       [,2]        [,3]
## [1,]  2.2972423  1.1408652 -0.42010889
## [2,]  1.1408652  0.5665808 -0.20863607
## [3,] -0.4201089 -0.2086361  0.07682754

Do not be afraid of iteration

sim.ar <- function(n,x1,phi=0,sd=1) {
  x <- vector(length=n)
  x[1] <- x1
  for (t in 2:n) {
    x[t] <- phi*x[t-1]+rnorm(1,mean=0,sd=sd)
  }
  return(x)
}

Returning vectors and lists

A function can return one thing. Often you want to return more than one value. - For several values of the same type, you can return a vector - For several different sorts of values, you can return a list. You can even assign names!

For an example of returning a list, we return to the random correlation matrix example

# Function to return the correlation matrix of 5 length-10 Gaussian vectors
random.correlation.matrix = function(){
  #Draw a 10x5 matrix of Gaussians
  X = matrix(rnorm(50),nrow=10,ncol=5)
  #Compute the correlation matrix
  cor.matrix = cor(X)
  cor.matrix
}
#Function to compute summaries of a correlation matrix
matrix.extremes = function(M){
  #Extract the off-diagonal entries
  off.diagonal = M[upper.tri(M)]
  #Return the largest and smallest off-diagonal entries
  return(list(max = max(off.diagonal), min = min(off.diagonal)))
}
matrix.extremes(random.correlation.matrix())
## $max
## [1] 0.6462206
## 
## $min
## [1] -0.6484274
replicate(5,matrix.extremes(random.correlation.matrix()))
##     [,1]       [,2]       [,3]       [,4]       [,5]      
## max 0.5685958  0.5498803  0.4050465  0.4095796  0.729471  
## min -0.3054157 -0.2361346 -0.5031285 -0.6958686 -0.3911191

For an example of returning a vector, we look at our rpareto function from earlier:

# Generate random numbers from the Pareto distribution
# Inputs: number of random draws (n)
  # exponent of the distribution (exponent)
  # lower threshold of the distribution (threshold)
# Outputs: vector of random numbers
rpareto.3 <- function(n,exponent,threshold) {
  x = qpareto.4(p=runif(n), exponent=exponent, threshold=threshold)
  return(x) #If n>1, this is a vector!
}

#This will return a vector of 10 things:
rpareto.3(10,2.5,1)
##  [1] 1.022777 1.287552 1.280614 2.908297 2.329300 2.340561 2.386480
##  [8] 2.973543 1.347765 2.358582