Statistical Computing, 36-350
Tuesday September 14, 2021
[ ] and [[ ]])if(), else if(), else: standard conditionalsifelse(): shortcut for using if() and else in combinationswitch(): shortcut for using if(), elseif(), and else in combinationfor(), while(): standard loop constructsfor() loops, vectorization is your friend!Data frames
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Use data.frame(), similar to how we create lists
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df## nums chars bools
## 1 0.1 a FALSE
## 2 0.2 b FALSE
## 3 0.3 c FALSE
## 4 0.4 d TRUE
## 5 0.5 e TRUE
## 6 0.6 f FALSE
# Recall, a list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## [1] "a" "b" "c" "d" "e" "f"
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
## [1] "matrix" "array"
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## [1] "factor"
## [1] South West West South West West
## Levels: Northeast South North Central West
## [1] "factor"
## [1] East South Central Pacific Mountain West South Central
## [5] Pacific Mountain
## 9 Levels: New England Middle Atlantic South Atlantic ... Pacific
# Combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)## [1] "data.frame"
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## Region Division
## Alabama South East South Central
## Alaska West Pacific
## Arizona West Mountain
## Arkansas South West South Central
## California West Pacific
## Colorado West Mountain
To add columns: we can either use data.frame(), or directly define a new named column
# First way: use data.frame() to concatenate on a new column
state.df = data.frame(state.df, Cool=sample(c(T,F), nrow(state.df), rep=TRUE))
head(state.df, 4)## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific FALSE
## Arizona West Mountain TRUE
## Arkansas South West South Central FALSE
# Second way: just directly define a new named column
state.df$Score = sample(1:100, nrow(state.df), replace=TRUE)
head(state.df, 4)## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool Score
## Alabama South East South Central FALSE 50
## Alaska West Pacific FALSE 28
## Arizona West Mountain TRUE 87
## Arkansas South West South Central FALSE 27
To delete columns: we can either use negative integer indexing, or set a column to NULL
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific FALSE
## Arizona West Mountain TRUE
## Arkansas South West South Central FALSE
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division
## Alabama South East South Central
## Alaska West Pacific
## Arizona West Mountain
## Arkansas South West South Central
With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state.df[state.df$Division == "New England", "Frost"]) ## [1] 145.3333
## [1] 49.6
subset(): extract rows based on a conditionThe subset() function provides a convenient alternative way of accessing rows for data frames
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)## [1] TRUE
# Same calculation as in the last slide, using subset()
mean(subset(state.df, Division == "New England")$Frost)## [1] 145.3333
## [1] 49.6
apply()
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop; can be simpler and faster, though not always. Summary of functions:
apply(): apply a function to rows or columns of a matrix or data framelapply(): apply a function to elements of a list or vectorsapply(): same as the above, but simplify the output (if possible)tapply(): apply a function to levels of a factor vectorapply(): rows or columns of a matrix or data frameThe apply() function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame xapply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame x## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 365.00 3098.00 0.50 67.96 1.40 37.80 0.00
## Area
## 1049.00
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 21198.0 6315.0 2.8 73.6 15.1 67.3 188.0
## Area
## 566432.0
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 5 2 18 11 1 44 28
## Area
## 2
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Min. 365.00 3098.00 0.500 67.9600 1.400 37.800 0.00 1049.00
## 1st Qu. 1079.50 3992.75 0.625 70.1175 4.350 48.050 66.25 36985.25
## Median 2838.50 4519.00 0.950 70.6750 6.850 53.250 114.50 54277.00
## Mean 4246.42 4435.80 1.170 70.8786 7.378 53.108 104.46 70735.88
## 3rd Qu. 4968.50 4813.50 1.575 71.8925 10.675 59.150 139.75 81162.50
## Max. 21198.00 6315.00 2.800 73.6000 15.100 67.300 188.00 566432.00
For a custom function, we can just define it before hand, and the use apply() as usual
# Our custom function: trimmed mean
trimmed.mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean) ## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
Can tell apply() to pass extra arguments to the function in question. E.g., can use: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2), for two extra arguments extra.arg.1, extra.arg.2 to be passed to my.fun()
# Our custom function: trimmed mean, with user-specified percentiles
trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)## Population Income Illiteracy Life Exp Murder HS Grad
## 3974.125000 4424.520833 1.136735 70.882708 7.341667 53.131250
## Frost Area
## 104.895833 61860.687500
What kind of data type will apply() give us? Depends on what function we pass. Summary, say, with FUN=my.fun():
my.fun() returns a single value, then apply() will return a vectormy.fun() returns k values, then apply() will return a matrix with k rows (note: this is true regardless of whether MARGIN=1 or MARGIN=2)my.fun() returns different length outputs for different inputs, then apply() will return a listmy.fun() returns a list, then apply() will return a listWe’ll grapple with this on the lab. This is one main advantage of purrr package: there is a much more transparent return object type
Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply(). E.g.,
rowSums(), colSums(): for computing row, column sums of a matrixrowMeans(), colMeans(): for computing row, column means of a matrixmax.col(): for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })## [1] 0 1 0
## [1] 0 1 0
lapply(), sapply(), tapply()
lapply(): elements of a list or vectorThe lapply() function takes inputs as in: lapply(x, FUN=my.fun), to apply my.fun() across elements of a list or vector x. The output is always a list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] FALSE FALSE FALSE FALSE TRUE FALSE
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $nums
## [1] 0.35
##
## $chars
## [1] NA
##
## $bools
## [1] 0.1666667
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 5 1
sapply(): elements of a list or vectorThe sapply() function works just like lapply(), but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## nums chars bools
## 0.3500000 NA 0.1666667
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE
## logical 5 1
tapply(): levels of a factor vectorThe function tapply() takes inputs as in: tapply(x, INDEX=my.index, FUN=my.fun), to apply my.fun() to subsets of entries in x that share a common level in my.index
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
split(): split by levels of a factorThe function split() split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index) to split a data frame x according to levels of my.index
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list## [1] "list"
## [1] "Northeast" "South" "North Central" "West"
## [1] "data.frame"
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
## Maine 1058 3694 0.7 70.39 2.7 54.7 161 30920
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 5495.111111 4570.222222 1.000000 71.264444 4.722222 53.966667
## Frost Area
## 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
## Frost Area
## 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
## Frost Area
## 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01
## Frost Area
## 1.021538e+02 1.344630e+05
subset(): function for extracting rows of a data frame meeting a conditionsplit(): function for splitting up rows of a data frame, according to a factor variableapply(): function for applying a given routine to rows or columns of a matrix or data framelapply(): similar, but used for applying a routine to elements of a vector or listsapply(): similar, but will try to simplify the return type, in comparison to lapply()tapply(): function for applying a given routine to groups of elements in a vector or list, according to a factor variable