Your homework must be submitted in R Markdown format. We will not (indeed, cannot) grade homeworks in other formats. Your responses must be supported by both textual explanations and the code you generate to produce your result. (Just examining your various objects in the “Environment” section of R Studio is insufficient—you must use scripted commands.)
The data set at http://www.stat.cmu.edu/~ryantibs/statcomp-F15/homework/capa11.csv contains information about the housing stock of California and Pennsylvania, as of 2011. Information is aggregated into “Census tracts”, geographic regions of a few thousand people which are supposed to be fairly homogeneous economically and socially.
General hint: see Recipes 10.1 and 10.2 in “The R Cookbook” for making scatterplots, and 10.18 and 10.26 for general plotting.
ca_pa
. Remember to pass the full URL of the website containing the data file to read.table()
(instead of calling read.table()
with a path to a local version of the file that you downloaded).ca_pa_mat
be the result of casting ca_pa
as a matrix, as below. What happens to numerical columns? (Hint: now try to plot a histogram of the column named Median_rooms
.) What difference between the two data types matrix and data frames does this highlight?ca_pa_mat = as.matrix(ca_pa)
apply(ca_pa,c(1,2),is.na)
colSums(apply(ca_pa,c(1,2),is.na))
na.omit()
takes a data frame and returns a new data frame, omitting any row containing an NA
value. Using it, how many rows are eliminated? Use this new dataset to answer questions 2 through 5.na.omit()
using your own control flow statements (for/while loops, or apply()
). Extra credit for using apply()
or its cousins, like sapply()
.Built_2005_or_later
indicates the percentage of houses in each Census tract built since 2005. Plot median house prices against this variable.STATEFP
variable, with California being state 6 and Pennsylvania state 42.) There is some plotting example code in the end.COUNTYFP
contains a numerical code for counties within each state. We are interested in Alameda County (county 1 in California), Santa Clara (county 85 in California), and Allegheny County (county 3 in Pennsylvania).
xlim
argument in plot()
. What are some key differences in their distributions? (Hint: see next problem.)cor
function calculates the correlation coefficient between two variables. What is the correlation between median house value and the percent of housing built since 2005 in (i) the whole data, (ii) all of California, (iii) all of Pennsylvania, (iv) Alameda County, (v) Santa Clara County and (vi) Allegheny County?legend()
. (This time, the function points()
, and general plotting arguments like pch
, col
, cex
are your friends!)acca <- c()
for (tract in 1:nrow(ca_pa)) {
if (ca_pa$STATEFP[tract] == 6) {
if (ca_pa$COUNTYFP[tract] == 1) {
acca <- c(acca, tract)
}
}
}
accamhv <- c()
for (tract in acca) {
accamhv <- c(accamhv, ca_pa[tract,10])
}
median(accamhv)
Example plotting code
You may use following example code for plotting in 2b, 4b and 4e. The easiest way is to get started is to change some arguments slightly to see what happens. Detailed help can also be found in sections 10.18 or 10.26 in “The R Cookbook”, or by typing ?plot
in the R console.
# Tell R to draw plot in a 1 by 3 array
par(mfrow=c(1,3))
plot(x = 1:5, y = 1:5, col = "blue", cex = 1, xlim = c(0,6), ylim = c(0,6),
type = 'p', main = "this title", xlab = "x", ylab = "y")
# Insert points on an existing plot
points(x = 1:3, y = (3:1)+.5, cex = 2, col = "green", pch = 17)
legend("topright", pt.cex = c(1,2), col = c("blue","green"), pch = c(1,17),
legend = c("these blue points", "those green points"))
plot(x = 5:1, y = 1:5, col = "red", xlim = c(0,6), ylim = c(0,6), type = 'p',
main = "that title", xlab = "x", ylab = "y")
legend("bottomleft", col = "red", pch=1, legend = c("more red points"))
hist(rnorm(100,1,2), xlim = c(-9,9), main = "Histogram example",
xlab = "x label goes here", col = 'pink')