The apply family

R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop. Can be simpler and faster than a for() loop, though not always

Below is a summary. We’ll cover apply() today, and the rest next time

apply(): apply a function to rows or columns of a matrix or data frame
lapply(): apply a function to elements of a list or vector
sapply(): same as the above, but simplify the output (if possible)
tapply(): apply a function to levels of a factor vector

`apply()`, rows or columns of a matrix or data frame

The apply() function takes inputs of the following form:

apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame x

x = matrix(rnorm(9), 3, 3) # Create a 9 x 9 matrix of random normals
x

##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550

apply(x, MARGIN=1, FUN=min) # Smallest entry in each row

## [1]  0.7187638 -1.9895921 -0.7270550

apply(x, MARGIN=1, FUN=sum) # Sum of entries in each row

## [1]  4.048065 -2.592469  1.603177

(Continued)

head(state.x77) # Matrix of states data, 50 states x 8 variables

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##    21198.0     6315.0        2.8       73.6       15.1       67.3 
##      Frost       Area 
##      188.0   566432.0

apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##          5          2         18         11          1         44 
##      Frost       Area 
##         28          2

apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!

##         Population Income Illiteracy Life Exp Murder HS Grad  Frost   Area
## Min.           365   3098      0.500    67.96  1.400   37.80   0.00   1049
## 1st Qu.       1080   3993      0.625    70.12  4.350   48.05  66.25  36990
## Median        2838   4519      0.950    70.68  6.850   53.25 114.50  54280
## Mean          4246   4436      1.170    70.88  7.378   53.11 104.50  70740
## 3rd Qu.       4968   4814      1.575    71.89 10.680   59.15 139.80  81160
## Max.         21200   6315      2.800    73.60 15.100   67.30 188.00 566400

Applying a custom function

For a custom function, we can just define it before hand, and the use apply() as usual

# Our custom function
my.fun = function(v) {  v.mean = mean(v) 
  q1 = quantile(v, prob=0.1)
  q2 = quantile(v, prob=0.9)
  cat(paste("The 0.1 quantile is", q1, "! "))
  cat(paste("The 0.9 quantile is", q2, "!\n"))
  v.mean = mean(v) # Regular mean
  v.trimmed.mean = mean(v[q1 <= v & v <= q2]) # Trimmed mean!
  c(v.mean, v.trimmed.mean)
}

mat = apply(state.x77, MARGIN=2, FUN=my.fun) # We get back a matrix

## The 0.1 quantile is 632.3 ! The 0.9 quantile is 10781.2 !
## The 0.1 quantile is 3623.3 ! The 0.9 quantile is 5117.5 !
## The 0.1 quantile is 0.6 ! The 0.9 quantile is 2.11 !
## The 0.1 quantile is 69.048 ! The 0.9 quantile is 72.582 !
## The 0.1 quantile is 2.67 ! The 0.9 quantile is 11.66 !
## The 0.1 quantile is 40.96 ! The 0.9 quantile is 62.96 !
## The 0.1 quantile is 20 ! The 0.9 quantile is 168.4 !
## The 0.1 quantile is 7795.5 ! The 0.9 quantile is 114216.5 !

mat # First row is the mean, second row is the trimmed mean

##      Population   Income Illiteracy Life Exp Murder HS Grad    Frost
## [1,]   4246.420 4435.800    1.17000 70.87860 7.3780 53.1080 104.4600
## [2,]   3384.275 4430.075    1.07381 70.91775 7.2975 53.3375 104.6829
##          Area
## [1,] 70735.88
## [2,] 56575.72

Applying a custom function “on-the-fly”

Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient

# Compute trimmed means, defining custom function on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) { 
  q1 = quantile(v, prob=0.1)
  q2 = quantile(v, prob=0.9)
  mean(v[q1 <= v & v <= q2]) # Trimmed mean!
})

##  Population      Income  Illiteracy    Life Exp      Murder     HS Grad 
##  3384.27500  4430.07500     1.07381    70.91775     7.29750    53.33750 
##       Frost        Area 
##   104.68293 56575.72500

Applying a function that takes extra arguments

Sometimes we want to use a function over rows or columns of a matrix, that takes extra arguments (besides the row or column itself). We can pass these as inputs to apply(), as in: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2), for two extra arguments extra.arg.1, extra.arg.2 to be passed to my.fun()

# Function that gets indices of the biggest 3 entries of v, then returns the
# corresponding 3 elements of names.v
top.3.names = function(v, names.v) { names.v[order(v, decreasing=TRUE)[1:3]] }
# Now we'll run this function on each column of state.x77. Note: here v will
# be a column, and for names.v, we'll pass in rownames(state.x77), i.e., the
# state names
apply(state.x77, MARGIN=2, FUN=top.3.names, names.v=rownames(state.x77))

##      Population   Income        Illiteracy       Life Exp    Murder     
## [1,] "California" "Alaska"      "Louisiana"      "Hawaii"    "Alabama"  
## [2,] "New York"   "Connecticut" "Mississippi"    "Minnesota" "Georgia"  
## [3,] "Texas"      "Maryland"    "South Carolina" "Utah"      "Louisiana"
##      HS Grad  Frost           Area        
## [1,] "Utah"   "Nevada"        "Alaska"    
## [2,] "Alaska" "North Dakota"  "Texas"     
## [3,] "Nevada" "New Hampshire" "California"

What’s the return argument?

What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my.fun():

If my.fun() returns a single value, then apply() will return a vector
If my.fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN=1 or MARGIN=2)
If my.fun() returns different length output for different inputs, then apply() will return a list
If my.fun() returns a list, then apply() will return a list

Optimized functions for special tasks

Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,

rowSums(), colSums(): for computing row, column sums of a matrix
rowMeans(), colMeans(): for computing row, column means of a matrix
max.col(): for finding the maximum position in each row of a matrix

Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?

##            [,1]      [,2]       [,3]
## [1,]  1.7999350  1.529366  0.7187638
## [2,] -0.2829916 -0.319885 -1.9895921
## [3,]  0.3861865  1.944046 -0.7270550

# DON'T do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { sum(v>0) })

## [1] 3 0 2

# DO do this (much faster, simpler for big matrices)
rowSums(x > 0)

## [1] 3 0 2

The Apply Family