Statistical Computing, 36-350
Monday October 10, 2016
The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames
Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)
# Creating a data frame is like creating a list
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
## nums chars bools
## 1 0.1 a FALSE
## 2 0.2 b TRUE
## 3 0.3 c TRUE
## 4 0.4 d FALSE
## 5 0.5 e TRUE
## 6 0.6 f TRUE
# But note, the list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## $chars
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
## $bools
# Accessing a data frame is like accessing a matrix, or a list
my.df[,1] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df[,"nums"] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$nums # Doesn't work for a matrix, but works for a list
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$chars # Note: this one has been converted into a factor data type!
## [1] a b c d e f
## Levels: a b c d e f
as.character(my.df$chars) # Converting it back to a character data type
## [1] "a" "b" "c" "d" "e" "f"
Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame
class(state.x77) # Matrix of states data, 50 states x 8 variables
## [1] "matrix"
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
class(state.region) # Factor of regions for the 50 states
## [1] "factor"
## [1] South West West South West West
## Levels: Northeast South North Central West
class(state.division) # Factor of divisions for the 50 states
## [1] "factor"
## [1] East South Central Pacific Mountain
## [4] West South Central Pacific Mountain
## 9 Levels: New England Middle Atlantic ... Pacific
# Let's combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
## [1] "data.frame"
head(state.df) # Note that the first 8 columns name carried over from state.x77
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area Region Division
## Alabama 50708 South East South Central
## Alaska 566432 West Pacific
## Arizona 113417 West Mountain
## Arkansas 51945 South West South Central
## California 156361 West Pacific
## Colorado 103766 West Mountain
With data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with logical indexing
# Compare the averages of the Frost column between states in the New England
# and Pacific divisions. Pay careful attention, only need two lines of code!
mean(state.df[state.df$Division == "New England", "Frost"])
## [1] 145.3333
mean(state.df[state.df$Division == "Pacific", "Frost"]) # Those wimps!
## [1] 49.6
Another way of accessing a subset of the rows, sometimes easier with data frames, is through subset()
# Using subset(), we can just use the column names directly, i.e., no need
# for writing state.df$Division, can just use Division = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually = state.df[state.df$Division == "New England", ]
all( ==
## [1] TRUE
# Pay attention again, only two lines of code, for the same comparison!
mean(subset(state.df, Division == "New England")$Frost)
## [1] 145.3333
mean(subset(state.df, Division == "Pacific")$Frost) # Wimps
## [1] 49.6
Using with()
, we can refer to the columns in our data frame directly by name. Saves us from writing, e.g., state.df$Population
, state.df$Area
, etc., and instead we use Population
, Area
, etc.
pop.dens.1 = state.df$Population/state.df$Area # Compute the population density
pop.dens.2 = with(state.df, Population/Area) # Same calculation
all(pop.dens.1 == pop.dens.2)
## [1] TRUE
We’ll learn a lot more about apply()
next time, but here’s the basics:
apply(x, 1, FUN)
applies a function FUN
to rows of a matrix or data frame x
apply(x, 2, FUN)
applies a function FUN
to columns of a matrix or data frame x
head(state.x77) # We'll consider only the numeric variables
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
apply(state.x77, 2, min) # Compute the min of each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 365.00 3098.00 0.50 67.96 1.40 37.80
## Frost Area
## 0.00 1049.00
apply(state.x77, 2, max) # Compute the max of each column
## Population Income Illiteracy Life Exp Murder HS Grad
## 21198.0 6315.0 2.8 73.6 15.1 67.3
## Frost Area
## 188.0 566432.0
# Define our own Range function, then apply it to each column
Range = function(x) { max(x) - min(x) }
apply(state.x77, 2, Range)
## Population Income Illiteracy Life Exp Murder HS Grad
## 20833.00 3217.00 2.30 5.64 13.70 29.50
## Frost Area
## 188.00 565383.00