Reminder: data frames

The format for the “classic” data table in statistics: data frame. Lots of the “really-statistical” parts of the R programming language presume data frames

Difference between data frames and lists? Each column in a data frame must have the same length (each element in the list can be of different lengths)

Data frame examples

# Creating a data frame is like creating a list
my.df = data.frame(nums=seq(0.1,0.6,by=0.1), chars=letters[1:6], 
                   bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.df
##   nums chars bools
## 1  0.1     a  TRUE
## 2  0.2     b FALSE
## 3  0.3     c FALSE
## 4  0.4     d FALSE
## 5  0.5     e FALSE
## 6  0.6     f  TRUE
# But note, the list can have different lengths for different elements!
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12], 
               bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
my.list
## $nums
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
## 
## $chars
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
## 
## $bools
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE

(Continued)

# Accessing a data frame is like accessing a matrix, or a list
my.df[,1] # Also works for a matrix 
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df[,"nums"] # Also works for a matrix
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$nums # Doesn't work for a matrix, but works for a list
## [1] 0.1 0.2 0.3 0.4 0.5 0.6
my.df$chars # Note: this one has been converted into a factor data type!
## [1] a b c d e f
## Levels: a b c d e f
as.character(my.df$chars) # Converting it back to a character data type
## [1] "a" "b" "c" "d" "e" "f"

Building a data frame out of a matrix

Often times it’s helpful to start with a matrix, and add columns (of different data types) to make it a data frame

class(state.x77) # Matrix of states data, 50 states x 8 variables
## [1] "matrix"
head(state.x77) 
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766
class(state.region) # Factor of regions for the 50 states
## [1] "factor"
head(state.region)
## [1] South West  West  South West  West 
## Levels: Northeast South North Central West
class(state.division) # Factor of divisions for the 50 states
## [1] "factor"
head(state.division) 
## [1] East South Central Pacific            Mountain          
## [4] West South Central Pacific            Mountain          
## 9 Levels: New England Middle Atlantic ... Pacific
# Let's combine these into a data frame with 50 rows and 10 columns
state.df = data.frame(state.x77, Region=state.region, Division=state.division)
class(state.df)
## [1] "data.frame"
head(state.df) # Note that the first 8 columns name carried over from state.x77
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area Region           Division
## Alabama     50708  South East South Central
## Alaska     566432   West            Pacific
## Arizona    113417   West           Mountain
## Arkansas    51945  South West South Central
## California 156361   West            Pacific
## Colorado   103766   West           Mountain

Reminder: accessing rows

With data frames, we’ll often want to access a subset of the rows corresponding to some condition. You already know how to do this, with logical indexing

# Compare the averages of the Frost column between states in the New England
# and Pacific divisions. Pay careful attention, only need two lines of code!
mean(state.df[state.df$Division == "New England", "Frost"]) 
## [1] 145.3333
mean(state.df[state.df$Division == "Pacific", "Frost"]) # Those wimps!
## [1] 49.6

subset()

Another way of accessing a subset of the rows, sometimes easier with data frames, is through subset()

# Using subset(), we can just use the column names directly, i.e., no need 
# for writing state.df$Division, can just use Division
state.df.ne.1 = subset(state.df, Division == "New England")
# Get same thing by extracting the appropriate rows manually
state.df.ne.2 = state.df[state.df$Division == "New England", ]
all(state.df.ne.1 == state.df.ne.2)
## [1] TRUE
# Pay attention again, only two lines of code, for the same comparison!
mean(subset(state.df, Division == "New England")$Frost)
## [1] 145.3333
mean(subset(state.df, Division == "Pacific")$Frost) # Wimps
## [1] 49.6

with()

Using with(), we can refer to the columns in our data frame directly by name. Saves us from writing, e.g., state.df$Population, state.df$Area, etc., and instead we use Population, Area, etc.

pop.dens.1 = state.df$Population/state.df$Area # Compute the population density 
pop.dens.2 = with(state.df, Population/Area) # Same calculation
all(pop.dens.1 == pop.dens.2)
## [1] TRUE

apply()

We’ll learn a lot more about apply() next time, but here’s the basics:

head(state.x77) # We'll consider only the numeric variables
##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766
apply(state.x77, 2, min) # Compute the min of each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##     365.00    3098.00       0.50      67.96       1.40      37.80 
##      Frost       Area 
##       0.00    1049.00
apply(state.x77, 2, max) # Compute the max of each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##    21198.0     6315.0        2.8       73.6       15.1       67.3 
##      Frost       Area 
##      188.0   566432.0
# Define our own Range function, then apply it to each column
Range = function(x) { max(x) - min(x) }
apply(state.x77, 2, Range)
## Population     Income Illiteracy   Life Exp     Murder    HS Grad 
##   20833.00    3217.00       2.30       5.64      13.70      29.50 
##      Frost       Area 
##     188.00  565383.00