Data structures tie related values into one object
Functions tie related commands into one object
In both cases: easier to understand, easier to work with, easier to build into larger things
# "Huber" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x)
# Outputs: vector with x^2 for small entries, 2|x|-1 for large ones
psi.1 = function(x) {
psi = ifelse(x^2 > 1, 2*abs(x)-1, x^2)
return(psi)
}
Our functions get used just like the built-in ones:
z = c(-0.5,-5,0.9,9)
psi.1(z)
## [1] 0.25 9.00 0.81 17.00
Go back to the declaration and look at the parts:
# "Huber" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x)
# Outputs: vector with x^2 for small entries, 2|x|-1 for large ones
psi.1 = function(x) {
psi = ifelse(x^2 > 1, 2*abs(x)-1, x^2)
return(psi)
}
Two interfaces: the inputs or arguments; the outputs or return value
Calls other functions ifelse()
, abs()
, operators ^
and >
. Could have also called other functions we’ve written
return()
says what the output is. With no explicit return statement, the function just outputs what’s on the last line
Comments: not required by R, but a very good idea! One-line description of purpose; listing of arguments; listing of outputs
# "Hubger" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x), scale for crossover (c)
# Outputs: vector with x^2 for small entries, 2c|x|-c^2 for large ones
psi.2 = function(x, c=1) {
psi = ifelse(x^2 > c^2, 2*c*abs(x)-c^2, x^2)
return(psi)
}
psi.1(z)
## [1] 0.25 9.00 0.81 17.00
psi.2(z,1) # Same
## [1] 0.25 9.00 0.81 17.00
Default values get used if arguments are missing:
psi.2(z) # Same
## [1] 0.25 9.00 0.81 17.00
Named arguments can go in any order when explicitly labeled:
psi.2(z,1)
## [1] 0.25 9.00 0.81 17.00
psi.2(z,c=1) # Same
## [1] 0.25 9.00 0.81 17.00
psi.2(x=z,c=1) # Same
## [1] 0.25 9.00 0.81 17.00
psi.2(c=1,x=z) # Same
## [1] 0.25 9.00 0.81 17.00
psi.2(1,z) # Different!
## [1] -1.25 1.00 0.99 1.00
Odd behavior can occur when arguments are passed that we don’t expect
psi.2(x=z,c=c(1,1,1,10))
## [1] 0.25 9.00 0.81 81.00
psi.2(x=z,c=-1)
## [1] 0.25 -11.00 0.81 -19.00
So we can put few sanity checks into the code
# "Huber" loss function, for outlier-resistant regression
# Inputs: vector of numbers (x), scale for crossover (c)
# Outputs: vector with x^2 for small entries, 2c|x|-c^2 for large ones
psi.3 = function(x, c=1) {
# Scale should be a single positive number
stopifnot(length(c)==1, c>0)
psi = ifelse(x^2 > c^2, 2*c*abs(x)-c^2, x^2)
return(psi)
}
Arguments to stopifnot()
are a series of expressions which should all be TRUE; execution halts, with error message, at first FALSE
x = 7
y = c("A","C","G","T","U")
adder = function(y) { x = x+y; return(x) }
adder(1)
## [1] 8
x
## [1] 7
y
## [1] "A" "C" "G" "T" "U"
circle.area = function(r) { return(pi*r^2) }
circle.area(c(1,2,3))
## [1] 3.141593 12.566371 28.274334
truepi = pi
pi = 3 # Valid in 1800s Indiana
circle.area(c(1,2,3))
## [1] 3 12 27
pi = truepi # Restore sanity
circle.area(c(1,2,3))
## [1] 3.141593 12.566371 28.274334
Interfaces mark out a controlled inner environment for our code
Interact with the rest of the system only at the interface
Advice: arguments explicitly give the function all the information
- Reduces risk of confusion and error
- Exception: true universals like \(\pi\)
Likewise, output should only be through the return value
More about breaking up tasks and about environments later
Fact: bigger cities tend to produce more economically per capita
A proposed statistical model (Geoffrey West and others): \[ Y = y_0 N^{a} + \mathrm{noise} \] where \(Y\) is the per-capita “gross metropolitan product” of a city, \(N\) is its population, and \(y_0\) and \(a\) are parameters
gmp = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/lectures/gmp.dat")
gmp$pop = gmp$gmp/gmp$pcgmp
plot(gmp$pop, gmp$pcgmp, log="x", xlab="Population",
ylab="Per-capita economic output ($/person-year)",
main="US metropolitan areas, 2006")
curve(6611*x^(1/8),add=TRUE,col="blue")
We want to fit the model \[ Y = y_0 N^{a} + \mathrm{noise} \] to some data. Take \(y_0=6611\) for today
Unfortunately there’s not an easy way to do this with a single mathematical formula. But we can do this iteratively. Let’s approximate the derivative of error with respect to \(a\), and move in the opposite direction
An actual first attempt at code:
maximum.iterations = 100
deriv.step = 1/1000
step.scale = 1e-12
stopping.deriv = 1/100
iteration = 0
deriv = Inf
a = 0.15
while ((iteration < maximum.iterations) &&
(deriv > stopping.deriv)) {
iteration = iteration + 1
mse.1 = mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 = mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv = (mse.2 - mse.1)/deriv.step
a = a - step.scale*deriv
}
list(a=a,iterations=iteration,
converged=(iteration<maximum.iterations))
## $a
## [1] 0.1258166
##
## $iterations
## [1] 58
##
## $converged
## [1] TRUE
Let’s turn this into a function and then improve it
Second attempt, with logic fix:
estimate.scaling.exponent.1 = function(a) {
maximum.iterations = 100
deriv.step = 1/1000
step.scale = 1e-12
stopping.deriv = 1/100
iteration = 0
deriv = Inf
while ((iteration < maximum.iterations) &&
(abs(deriv) > stopping.deriv)) {
iteration = iteration + 1
mse.1 = mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 = mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv = (mse.2 - mse.1)/deriv.step
a = a - step.scale*deriv
}
fit = list(a=a,y0=y0,iterations=iteration,
converged=(iteration<maximum.iterations))
return(fit)
}
All those magic numbers are bad! Let’s make them defaults
estimate.scaling.exponent.2 = function(a, y0=6611,
maximum.iterations=100, deriv.step=0.001,
step.scale=1e-12, stopping.deriv=0.01) {
iteration = 0
deriv = Inf
while ((iteration < maximum.iterations) &&
(abs(deriv) > stopping.deriv)) {
iteration = iteration + 1
mse.1 = mean((gmp$pcgmp - y0*gmp$pop^a)^2)
mse.2 = mean((gmp$pcgmp - y0*gmp$pop^(a+deriv.step))^2)
deriv = (mse.2 - mse.1)/deriv.step
a = a - step.scale*deriv
}
fit = list(a=a,y0=y0,iterations=iteration,
converged=(iteration<maximum.iterations))
return(fit)
}
Why type out the same calculation of the MSE twice? Let’s create a function for this purpose
mse = function(a, y0, Y, N) { mean((Y-y0*N^a)^2) }
estimate.scaling.exponent.3 = function(a, y0=6611,
maximum.iterations=100, deriv.step=0.001,
step.scale=1e-12, stopping.deriv=0.01) {
iteration = 0
deriv = Inf
while ((iteration < maximum.iterations) &&
(abs(deriv) > stopping.deriv)) {
iteration = iteration + 1
deriv = (mse(a+deriv.step,y0,gmp$pcgmp,gmp$pop) -
mse(a,y0,gmp$pcgmp,gmp$pop)) / deriv.step
a = a - step.scale*deriv
}
fit = list(a=a,y0=y0,iterations=iteration,
converged=(iteration<maximum.iterations))
return(fit)
}
We’re locked in to using specific columns of gmp
; we shouldn’t have to re-write code just to compare two data sets. Let’s make more arguments, with defaults
estimate.scaling.exponent.4 = function(a, y0=6611,
Y=gmp$pcgmp, N=gmp$pop,
maximum.iterations=100, deriv.step=0.001,
step.scale=1e-12, stopping.deriv=0.01) {
iteration = 0
deriv = Inf
while ((iteration < maximum.iterations) &&
(abs(deriv) > stopping.deriv)) {
iteration = iteration + 1
deriv = (mse(a+deriv.step,y0,Y,N) -
mse(a,y0,Y,N)) / deriv.step
a = a - step.scale*deriv
}
fit = list(a=a,y0=y0,iterations=iteration,
converged=(iteration<maximum.iterations))
return(fit)
}
The final code is shorter, clearer, more flexible, and more re-usable
Exercises: - Run the code with the default values to get an estimate of \(a\); plot the curve along with the data points - Randomly remove one data point—how much does the estimate change? - Run the code from multiple starting points—how different are the estimates of \(a\)?
plm = estimate.scaling.exponent.4(0.1)
plm
## $a
## [1] 0.1258166
##
## $y0
## [1] 6611
##
## $iterations
## [1] 62
##
## $converged
## [1] TRUE
plot(gmp$pop, gmp$pcgmp, log="x", xlab="Population",
ylab="Per-capita economic output ($/person-year)",
main="US metropolitan areas, 2006")
curve(6611*x^plm$a,add=TRUE,col="blue")
We already wrote code plot this above … we’ve just copied and pasted it. What to do, if we were writing a report and needed to make many such plots (on say, different data sets)?
Yes, that’s right. Write a function to make these kind of plots!
plot.plm = function(plm, curve.col="blue", log="x",
Y=gmp$pcgmp, N=gmp$pop, ...) {
# Extract the parameters
a = plm$a
y0 = plm$y0
# Plot the data
plot(N,Y,log=log,...)
# Draw the curve
f = function(x) { return(y0*x^a) }
curve(f(x),add=TRUE,col=curve.col)
invisible(TRUE)
}
The ...
is a catch-all for any arguments the user wants to pass to the plot()
function (e.g., xlab
and `ylab)
The function silently returns a TRUE (hence the invisible()
, instead of a return()
)
plot.plm(plm)
plot.plm(plm,curve.col="red",
ylab="Per-capita economic output ($/person-year)",
main="US metropolitan areas, 2006",
pch=19,col="gray")