nchar()
, substr()
: functions for substring extractions and replacementsstrsplit()
, paste()
: functions for splitting and combining stringstable()
: function to get word counts, useful way of summarizing text dataData frames
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
Use data.frame()
, similar to how we create lists
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df
## nums chars bools
## 1 0.1 a FALSE
## 2 0.2 b TRUE
## 3 0.3 c FALSE
## 4 0.4 d TRUE
## 5 0.5 e TRUE
## 6 0.6 f TRUE
# Note, a list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] TRUE FALSE TRUE TRUE TRUE FALSE
my.df[,1] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df[,"nums"] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$nums # Doesn't work for a matrix, but works for a list
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$chars # Note: this one has been converted into a factor data type
## [1] a b c d e f
## Levels: a b c d e f
as.character(my.df$chars) # Converting it back to a character data type
## [1] "a" "b" "c" "d" "e" "f"
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
class(state.x77) # Built-in matrix of states data, 50 states x 8 variables
## [1] "matrix"
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
class(state.region) # Factor of regions for the 50 states
## [1] "factor"
head(state.region)
## [1] South West West South West West
## Levels: Northeast South North Central West
class(state.division) # Factor of divisions for the 50 states
## [1] "factor"
head(state.division)
## [1] East South Central Pacific Mountain
## [4] West South Central Pacific Mountain
## 9 Levels: New England Middle Atlantic ... Pacific
# Combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)
## [1] "data.frame"
head(state.df) # Note that the first 8 columns name carried over from state.x77
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Region Division
## Alabama 50708 South East South Central
## Alaska 566432 West Pacific
## Arizona 113417 West Mountain
## Arkansas 51945 South West South Central
## California 156361 West Pacific
## Colorado 103766 West Mountain
To add columns: we can either use data.frame()
, or directly define a new named column
# First way: use data.frame() to concatenate on a new column
state.df = data.frame(state.df, Cool=sample(c(T,F), nrow(state.df), rep=TRUE))
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific TRUE
## Arizona West Mountain TRUE
## Arkansas South West South Central FALSE
# Second way: just directly define a new named column
state.df$Score = sample(1:100, nrow(state.df), replace=TRUE)
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool Score
## Alabama South East South Central FALSE 81
## Alaska West Pacific TRUE 17
## Arizona West Mountain TRUE 50
## Arkansas South West South Central FALSE 14
To delete columns: we can either use negative integer indexing, or set a column to NULL
# First way: use negative integer indexing
state.df = state.df[,-ncol(state.df)]
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division Cool
## Alabama South East South Central FALSE
## Alaska West Pacific TRUE
## Arizona West Mountain TRUE
## Arkansas South West South Central FALSE
# Second way: just directly set a column to NULL
state.df$Cool = NULL
head(state.df, 4)
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Region Division
## Alabama South East South Central
## Alaska West Pacific
## Arizona West Mountain
## Arkansas South West South Central
With matrices or data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with Boolean indexing
# Compare the averages of the Frost column between states in New England and
# Pacific divisions
mean(state.df[state.df$Division == "New England", "Frost"])
## [1] 145.3333
mean(state.df[state.df$Division == "Pacific", "Frost"]) # Those wimps!
## [1] 49.6
subset()
The subset()
function provides a convenient alternative way of accessing rows for data frames
# Using subset(), we can just use the column names directly (i.e., no need for
# using $)
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)
## [1] TRUE
# Same calculation as in the last slide, using subset()
mean(subset(state.df, Division == "New England")$Frost)
## [1] 145.3333
mean(subset(state.df, Division == "Pacific")$Frost) # Wimps
## [1] 49.6
apply()
R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for()
loop; can be simpler and faster, though not always. Summary of functions:
apply()
: apply a function to rows or columns of a matrix or data framelapply()
: apply a function to elements of a list or vectorsapply()
: same as the above, but simplify the output (if possible)tapply()
: apply a function to levels of a factor vectorapply()
, rows or columns of a matrix or data frameThe apply()
function takes inputs of the following form:
apply(x, MARGIN=1, FUN=my.fun)
, to apply my.fun()
across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun)
, to apply my.fun()
across columns of a matrix or data frame x
apply(state.x77, MARGIN=2, FUN=min) # Minimum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 365.00 3098.00 0.50 67.96 1.40 37.80
## Frost Area
## 0.00 1049.00
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 5 2 18 11 1 44
## Frost Area
## 28 2
apply(state.x77, MARGIN=2, FUN=summary) # Summary of each col, get back matrix!
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Min. 365 3098 0.500 67.96 1.400 37.80 0.00 1049
## 1st Qu. 1080 3993 0.625 70.12 4.350 48.05 66.25 36990
## Median 2838 4519 0.950 70.68 6.850 53.25 114.50 54280
## Mean 4246 4436 1.170 70.88 7.378 53.11 104.50 70740
## 3rd Qu. 4968 4814 1.575 71.89 10.680 59.15 139.80 81160
## Max. 21200 6315 2.800 73.60 15.100 67.30 188.00 566400
For a custom function, we can just define it before hand, and the use apply()
as usual
# Our custom function: trimmed mean
trimmed.mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean)
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
We’ll learn more about functions later (don’t worry too much at this point about the details of the function definition)
Instead of defining a custom function before hand, we can just define it “on-the-fly”. Sometimes this is more convenient
# Compute trimmed means, defining this on-the-fly
apply(state.x77, MARGIN=2, FUN=function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})
## Population Income Illiteracy Life Exp Murder HS Grad
## 3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
## Frost Area
## 104.68293 56575.72500
Can tell apply()
to pass extra arguments to the function in question. E.g., can use: apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2)
, for two extra arguments extra.arg.1
, extra.arg.2
to be passed to my.fun()
# Our custom function: trimmed mean, with user-specified percentiles
trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}
apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)
## Population Income Illiteracy Life Exp Murder
## 3974.125000 4424.520833 1.136735 70.882708 7.341667
## HS Grad Frost Area
## 53.131250 104.895833 61860.687500
What kind of data type will apply()
give us? Depends on what function we pass. Summary, say, with FUN=my.fun()
:
my.fun()
returns a single value, then apply()
will return a vectormy.fun()
returns k values, then apply()
will return a matrix with k rows (note: this is true regardless of whether MARGIN=1
or MARGIN=2
)my.fun()
returns different length outputs for different inputs, then apply()
will return a listmy.fun()
returns a list, then apply()
will return a listWe’ll grapple with this on the lab/hw. (One advantage of **ply()
functions that we’ll see later is that there is a much more transparent return object type)
Don’t overuse the apply paradigm! There’s lots of special functions that optimized are will be both simpler and faster than using apply()
. E.g.,
rowSums()
, colSums()
: for computing row, column sums of a matrixrowMeans()
, colMeans()
: for computing row, column means of a matrixmax.col()
: for finding the maximum position in each row of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })
## [1] 1 2 3
# Do this insted (much faster, simpler)
rowSums(x > 0)
## [1] 1 2 3
lapply()
, vapply()
, tapply()
lapply()
, elements of a list or vectorThe lapply()
function takes inputs as in: lapply(x, FUN=my.fun)
, to apply my.fun()
across elements of a list or vector x
. The output is always a list
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
##
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
##
## $bools
## [1] TRUE FALSE TRUE TRUE TRUE FALSE
lapply(my.list, FUN=mean) # Get a warning: mean() can't be applied to chars
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## $nums
## [1] 0.35
##
## $chars
## [1] NA
##
## $bools
## [1] 0.6666667
lapply(my.list, FUN=summary)
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE NA's
## logical 2 4 0
sapply()
, elements of a list or vectorThe sapply()
function works just like lapply()
, but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector
sapply(my.list, FUN=mean) # Simplifies the result, now a vector
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## nums chars bools
## 0.3500000 NA 0.6666667
sapply(my.list, FUN=summary) # Can't simplify, so still a list
## $nums
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.225 0.350 0.350 0.475 0.600
##
## $chars
## Length Class Mode
## 12 character character
##
## $bools
## Mode FALSE TRUE NA's
## logical 2 4 0
tapply()
, levels of a factor vectorThe function tapply()
takes inputs as in: tapply(x, INDEX=my.index, FUN=my.fun)
, to apply my.fun()
to subsets of entries in x
that share a common level in my.index
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)
## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
split()
, split by levels of a factorThe function split()
split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index)
to split a data frame x
according to levels of my.index
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list
## [1] "list"
names(state.by.reg) # This has 4 elements for the 4 regions
## [1] "Northeast" "South" "North Central" "West"
class(state.by.reg[[1]]) # Each element is a data frame
## [1] "data.frame"
# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3)
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## Maine 1058 3694 0.7 70.39 2.7 54.7 161
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103
## Area
## Connecticut 4862
## Maine 30920
## Massachusetts 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Area
## Alaska 566432
## Arizona 113417
## California 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})
## $Northeast
## Population Income Illiteracy Life.Exp Murder
## 5495.111111 4570.222222 1.000000 71.264444 4.722222
## HS.Grad Frost Area
## 53.966667 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
## Frost Area
## 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
## Frost Area
## 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00
## HS.Grad Frost Area
## 6.200000e+01 1.021538e+02 1.344630e+05
subset()
: function for extracting rows of a data frame meeting a conditionsplit()
: function for splitting up rows of a data frame, according to a factor variableapply()
: function for applying a given routine to rows or columns of a matrix or data framelapply()
: similar, but used for applying a routine to elements of a vector or listsapply()
: similar, but will try to simplify the return type, in comparison to lapply()
tapply()
: function for applying a given routine to groups of elements in a vector or list, according to a factor variable