Statistical Computing, 36-350
Wednesday October 12, 2016
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for()
loop. Can be simpler and faster than a for()
loop, though not always
Below is a summary. We’ll cover apply()
today, and the rest next time
apply()
: apply a function to rows or columns of a matrix or data framelapply()
: apply a function to elements of a list or vectorsapply()
: same as the above, but simplify the output (if possible)tapply()
: apply a function to levels of a factor vectorapply()
, rows or columns of a matrix or data frameThe apply()
function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun)
, to apply my.fun()
across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun)
, to apply my.fun()
across columns of a matrix or data frame x
x = matrix(rnorm(9), 3, 3) # Create a 9 x 9 matrix of random normals
x
## [,1] [,2] [,3]
## [1,] 1.7999350 1.529366 0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,] 0.3861865 1.944046 -0.7270550
apply(x, MARGIN=1, FUN=min) # Smallest entry in each row
## [1] 0.7187638 -1.9895921 -0.7270550
apply(x, MARGIN=1, FUN=sum) # Sum of entries in each row
## [1] 4.048065 -2.592469 1.603177
head(state.x77) # Matrix of states data, 50 states x 8 variables
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 5 2 18 11 1 44
## Frost Area
## 28 2
apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Min. 365 3098 0.500 67.96 1.400 37.80 0.00 1049
## 1st Qu. 1080 3993 0.625 70.12 4.350 48.05 66.25 36990
## Median 2838 4519 0.950 70.68 6.850 53.25 114.50 54280
## Mean 4246 4436 1.170 70.88 7.378 53.11 104.50 70740
## 3rd Qu. 4968 4814 1.575 71.89 10.680 59.15 139.80 81160
## Max. 21200 6315 2.800 73.60 15.100 67.30 188.00 566400
For a custom function, we can just define it before hand, and the use apply()
as usual
# Our custom function
my.fun = function(v) { v.mean = mean(v)
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
cat(paste("The 0.1 quantile is", q1, "! "))
cat(paste("The 0.9 quantile is", q2, "!\n"))
v.mean = mean(v) # Regular mean
v.trimmed.mean = mean(v[q1 <= v & v <= q2]) # Trimmed mean!
c(v.mean, v.trimmed.mean)
}
mat = apply(state.x77, MARGIN=2, FUN=my.fun) # We get back a matrix
## The 0.1 quantile is 632.3 ! The 0.9 quantile is 10781.2 !
## The 0.1 quantile is 3623.3 ! The 0.9 quantile is 5117.5 !
## The 0.1 quantile is 0.6 ! The 0.9 quantile is 2.11 !
## The 0.1 quantile is 69.048 ! The 0.9 quantile is 72.582 !
## The 0.1 quantile is 2.67 ! The 0.9 quantile is 11.66 !
## The 0.1 quantile is 40.96 ! The 0.9 quantile is 62.96 !
## The 0.1 quantile is 20 ! The 0.9 quantile is 168.4 !
## The 0.1 quantile is 7795.5 ! The 0.9 quantile is 114216.5 !
mat # First row is the mean, second row is the trimmed mean
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## [1,] 4246.420 4435.800 1.17000 70.87860 7.3780 53.1080 104.4600
## [2,] 3384.275 4430.075 1.07381 70.91775 7.2975 53.3375 104.6829
## Area
## [1,] 70735.88
## [2,] 56575.72
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining custom function on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
mean(v[q1 <= v & v <= q2]) # Trimmed mean!
})
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
Sometimes we want to use a function over rows or columns of a matrix, that takes extra arguments (besides the row or column itself). We can pass these as inputs to apply()
, as in: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2)
, for two extra arguments extra.arg.1
, extra.arg.2
to be passed to my.fun()
# Function that gets indices of the biggest 3 entries of v, then returns the
# corresponding 3 elements of names.v
top.3.names = function(v, names.v) { names.v[order(v, decreasing=TRUE)[1:3]] }
# Now we'll run this function on each column of state.x77. Note: here v will
# be a column, and for names.v, we'll pass in rownames(state.x77), i.e., the
# state names
apply(state.x77, MARGIN=2, FUN=top.3.names, names.v=rownames(state.x77))
## Population Income Illiteracy Life Exp Murder
## [1,] "California" "Alaska" "Louisiana" "Hawaii" "Alabama"
## [2,] "New York" "Connecticut" "Mississippi" "Minnesota" "Georgia"
## [3,] "Texas" "Maryland" "South Carolina" "Utah" "Louisiana"
## HS Grad Frost Area
## [1,] "Utah" "Nevada" "Alaska"
## [2,] "Alaska" "North Dakota" "Texas"
## [3,] "Nevada" "New Hampshire" "California"
What kind of data type will apply()
give us? Depends on what function we pass. Summary, say, with FUN=my.fun()
:
my.fun()
returns a single value, then apply()
will return a vectormy.fun()
returns k values, then apply()
will return a matrix with k rows (note: this is true regardless of whether MARGIN=1
or MARGIN=2
)my.fun()
returns different length output for different inputs, then apply()
will return a listmy.fun()
returns a list, then apply()
will return a listDon’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply()
. E.g.,
rowSums()
, colSums()
: for computing row, column sums of a matrixrowMeans()
, colMeans()
: for computing row, column means of a matrixmax.col()
: for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x
## [,1] [,2] [,3]
## [1,] 1.7999350 1.529366 0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,] 0.3861865 1.944046 -0.7270550
# DON'T do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { sum(v>0) })
## [1] 3 0 2
# DO do this (much faster, simpler for big matrices)
rowSums(x > 0)
## [1] 3 0 2