Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Thursday 10pm, this week.
This week’s agenda: investigating the differences between data frames and matrices; practicing how to use the apply family of functions.
We’re going to look at a data set containing the number of assaults, murders, and rapes per 100,000 residents, in each of the 50 US states in 1973. This comes from a built-in data frame called USArrests
. We’ll rename this to crime.df
and append a column that gives the region for each state, from the built-in vector state.region
. You can learn more about this crime data set by typing ?USArrests
into your R console.
crime.df = data.frame(USArrests, Region = state.region)
1a. Report the number of rows of crime.df
, and print its first 6 rows. Using the functions is.data.frame()
and is.matrix()
, confirm that it is a data frame, and not a matrix.
1b. We’re ready to start investigating the differences between data frames and matrices. Use the as.matrix()
function to convert crime.df
into a matrix, calling the result crime.mat
. Print the first 6 rows of crime.mat
. Next, convert only the first 4 columns of crime.df
into a matrix, and call the result crime.mat.noregion
. Print the first 6 rows of crime.mat.noregion
. Take a look at the first 6 rows of crime.df
, crime.mat
, and crime.mat.noregion
. There is something unsatisfactory about crime.mat
. What is it and why did this happen? If you need some guidance, try using the class()
function to figure out the class of the first in each of the three objects.
1c. We now move to another difference between data frames and matrices, with regard to column access/indexing. Let’s start with something more typical. You can access the Murder
column of crime.df
by typing in crime.df[,"Murder"]
. Print the result to the console. Then, try using this same strategy to access the Murder
column of crime.mat.noregion
. Also print this result. Describe the difference (if any) between the two results.
1d. Let’s try a different way to access columns. You can access the Murder
column of crime.df
by also typing in crime.df$Murder
. Print out the result (it should be the same as the one in Q1c). Try using this same strategy to access Murder
column of crime.mat.noregion
. Describe the difference (if any) between the two results. Note: you will need to set error=TRUE
as an option in this code chunk to allow R Markdown to knit your lab, despite the the error you will encounter here.
1e. Lastly, we’ll demonstrate another difference between data frames and matrices, with regard to column additions. Compute a vector called TotalCrime
of length 50 that gives the sum of the values in Murder
, Assault
and Rape
for each of the 50 states. The first element of TotalCrime
should give the total crime in Alabama, the second element should give that in Alaska, etc. Do not use a for()
loop for this; use rowSums()
instead. Now, add TotalCrime
as a column to crime.df
, and make sure your new column is named TotalCrime
in the data frame. Note: there are many ways to do this. Print the first 6 rows of the new crime.df
data frame.
1f. Add the TotalCrime
vector to as a new column to crime.mat.noregion
, and make sure this column is named appropriately. Note: unlike the last question, there are not many ways to do this, there is only one. Print the first 6 rows of the new crime.mat.noregion
matrix.
for()
loopsThe purpose of the next several questions is to help you internalize how the apply functions—specifically, apply()
, sapply()
, lapply()
, and tapply()
—are essentially convenient ways to write for()
loops.
Here’s an example to get us started. Consider the following list, called lis
, which contains 4 vectors of 5 randomly generated numbers.
set.seed(10)
lis = list(rnorm(5), rnorm(5), rnorm(5), rnorm(5))
lis
## [[1]]
## [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513
##
## [[2]]
## [1] 0.3897943 -1.2080762 -0.3636760 -1.6266727 -0.2564784
##
## [[3]]
## [1] 1.1017795 0.7557815 -0.2382336 0.9874447 0.7413901
##
## [[4]]
## [1] 0.08934727 -0.95494386 -0.19515038 0.92552126 0.48297852
Suppose we wanted to compute the mean of each vector (so we’re looking for 4 numbers). We could do this using a for()
loop in the following way, storing the results in mean.vector
.
mean.vector = vector(length=length(lis), mode="numeric")
for (i in 1:length(lis)) {
mean.vector[i] = mean(lis[[i]])
}
mean.vector
## [1] -0.36829190 -0.61302179 0.66963246 0.06955056
We could also do this using a call to sapply()
, in the following simpler way, storing the result as mean.vector2
. This gives us the same exact answer.
mean.vector2 = sapply(lis, mean)
all.equal(mean.vector, mean.vector2)
## [1] TRUE
We’re going to ask you to emulate this for each of 3 other apply functions (lapply()
, apply()
and tapply()
) in the next 3 questions. Your goal will be to compute something using one of the apply functions or a for()
loop, and show they are the same. The tricky part here will be formatting the for()
loop properly to match exactly the apply function’s output.
2a. Compute the standard deviation of each of the 4 vectors in lis
, in two ways. For the first way use lapply()
, in just one line of code, and call the result sd.list
. For the second, use a for()
loop, and call the result sd.list2
. Use all.equal()
to show that sd.list
and sd.list2
are the same. Hint: to construct an empty list of length n
, you can use the command vector(length=n, mode="list")
.
2b. Using crime.mat.noregion
, compute the maximum value in each of the 5 columns, in two ways. For the first way, use apply()
, in just one line of code, and call the result max.vector
. For the second, use a for()
loop, and call the result max.vector2
. Use all.equal()
to show that max.vector
and max.vector2
are equal. Hint: this is a bit tricky because you’ll need to add names to max.vector2
in order to get all.equal()
to return TRUE
.
2c. Using crime.df
, compute the minimum value of Murder
within each of the four regions (Northeast, South, North Central, and West), in two ways. For the first way, use tapply()
, in just one line of code, and call the result min.vector
. For the second, use a for()
loop, and call the result min.vector2
. Use all.equal()
to show min.vector
and min.vector2
are equal. Hint: the trickiest part to figure out here is how to get the order of values in min.vector
and min.vector2
to be the same. Use levels(crime.df$Region)
to dictate the order of regions in min.vector2
. You’ll also have to cast min.vector2
to be the same data structure as min.vector
.
2d. Compute the quantiles of the Murder
column in crime.mat.noregion
using the quantile()
function, and print the result to the console. Now compute the quantiles of each of the columns of crime.mat.noregion
, using apply()
and quantile()
, in just one line of code. Store the resulting matrix as quant.mat
, print it out to the console, and comment on its dimensions and row and column names. Now compute the 10%, 20%, etc., through 90% quantiles of Murder
column with a single call to quantile()
, and print the result to the console. Hint: look at the documentation for quantile()
to figure out what argument to set in order to achieve this result. Do the same for each column of crime.mat.noregion
, using apply()
and quantile()
, and passing additional arguments as appropriate. Store the resulting matrix as quant.mat2
, and print it out to the console. Lastly (sorry to do this to you, but you probably guessed we would ask), replicate this with a for()
loop, calling the result quant.mat3
. Check using all.equal()
that quant.mat2
and quant.mat3
match. Hint: you’ll have to set the row and columns names of quant.mat3
appropriately.