Background. The American Community Survey (ACS) is a yearly survey given to a sample of the U.S. population, and collects information regarding demographics, occupations, educational attainment, veterans, whether people own or rent their home, and a variety of other attributes. The data set you will use in this exam is aggregated data between 2009 and 2013 for each of the 74,020 tracts—small Census Bureau statistical areas, nested within counties—in the United States. (Thanks to Jerzy Wieczorek for help assembling and cleaning this data.) In this exam, you are asked to manage, manipulate and analyse this data using R.


Part 0: Instructions

Make sure you read these.


Part 1: Loading and cleaning [10 pts]

The following code loads the plyr library (hint: it will be very useful for this exam!), and reads in the census data frame from http://www.stat.cmu.edu/~ryantibs/statcomp-F15/exams/census.RData. The result has 74020 rows and 31 columns. Each row represents a census tract, and each column a variable that has been measured.

library(plyr)
load(url("http://www.stat.cmu.edu/~ryantibs/statcomp-F15/exams/census.RData"))
dim(census)
## [1] 74020    31
  1. How many states are represented among the 74020 census tracts? How many counties? [1 pt]

  2. Columns 8 through 31 of the census data frame represent numeric variables, but columns 8 and 9 are not stored as such. These two are measured in US dollars: median household income (Med_HHD_Inc_ACS_09_13) and median house value (Med_House_value_ACS_09_13). What are the classes of these columns? [1 pt]

  3. Convert these two columns into numbers (in whole US dollars). (Hint: it will be helpful to first convert into strings, then remove any non-numeric characters, then convert into numbers.) For example, $63,030 should be converted into the integer 63030. Check your answer by printing out the summary() of these two new columns. Make sure that empty entries ("") are properly converted to NA. [3 pts]

  4. Several entries are missing in this data set, including the ones you discovered in the previous question. Compute the number of missing entries in each row, and save the vector as num.na.row. Then, obtain the indices of rows containing any missing values and save them in a vector named contains.na. What is the average number of missing values among the rows that contain at least one missing entry? [2 pts]

  5. Are there any states with no missing values? If so, print out the names of all such states. [1 pt]

  6. Redefine the census data frame by removing rows that have missing values, as per the contains.na vector computed in Part 1 Question 4. Check that the new census data frame has now 70877 rows. How many states and counties are represented in this new data frame? What states (if any) have been thrown out compared to the original data frame? [2 pts]


Part 2: Exploratory statistics [10 pts]

For this part, and the remainder of the exam, we will use the cleaned census data frame derived from Part 1 Question 6.

  1. There are several variables which count the percentage of respondents in various age categories (they all start with pct_Pop_). Use these to determine, for each census tract, the percentages of the population that are less than 5 years old, and append these values in a new column called pct_Pop_0_4_ACS_09_13 to the census data frame. Then, use this new column, along with the existing Tot_Population_ACS_09_13 column, to determine the number of 0-4 year olds in each state. Which state has the highest number, and what is its value? [3 pts]

  2. Using your answer from the last question, determine the percentage of 0-4 year olds in each state. Which state has the highest percentage, and what is its value? [1 pt]

  3. Calculate the correlation between each of the numeric variables, columns 8 through 31 of the census data frame. Which two variables have the highest positive correlation? Which two variables have the lowest negative correlation? Do these relationships make sense? [3 pts]

  4. Plot a histogram of Med_House_value_ACS_09_13, and label the axes appropriately. See that small bump at the value 1000001? This is a due to a common practice in public data sets of “top-coding”, i.e., censoring unusually large values to protect the anonymity of survey respondents. [1 pt]

  5. It is possible that the tracts that have been top-coded differ significantly from the other tracts, in other ways than the median house value. The following code computes a t-test between two groups of census tracts, on any given variable. The two groups being compared are: all tracts with median house value equal to the max (1000001) and all tracts with median house value less than the max (<1000001). It then returns a p-value for the test. Note that a lower p-value means a more significant difference between top-coded and non-top-coded tracts, according to the variable in question.

my.test = function(var) {
  group = census$Med_House_value_ACS_09_13 == 1000001
  p.val = t.test(var[group], var[!group])$p.value
  return(p.val)
}

Apply the function my.test() to the variables in columns 10 through 31 of the census data frame. What are the 2 smallest p-values, and which variables do these correspond to? Does this make sense? [2 pts]


Part 3: Sampling and plotting [20 pts]

Plotting data of this size is tricky just because the sheer number of datapoints. For instance, you can try on your own to plot(census) to see your computer wheeze, cough, crash and burn. In this problem, we will write a suite of functions that allow you to more efficiently plot relationships between pairs of variables in your dataset.

  1. Write a function plot.sample() that takes in five arguments: x, y, nsample, xlab, and ylab. The first two arguments are variables to be plotted. The third is the number of points to be randomly sampled, with a default of 500. (Hence, if x and y are vectors of length, say, 5000, then a random 500 of the total 5000 x-y pairs will be plotted, by default.) The last two are the x and y labels, both with defaults of "" (the empty string). A few notes:

After writing this function, you can try it out by uncommenting the following code below. [5 pts]

#plot.sample(census$Med_HHD_Inc_ACS_09_13, census$Med_House_value_ACS_09_13,
#            xlab="Median HHD income", ylab="Median house value")
  1. Next, write a function add.lineartrend.info(), which is designed to be called after plot.sample(). This takes as input x, y for variables that already appear on the x- and y-axes, and does two things:

To reiterate, this function assumes there is a plot already in place; it simply adds the line and the title. Again, after writing this function, you can try it out by uncommenting the following code below. [5 pts]

#plot.sample(census$Med_HHD_Inc_ACS_09_13, census$Med_House_value_ACS_09_13,
#            xlab="Median HHD income", ylab="Median house value")
#add.lineartrend.info(census$Med_HHD_Inc_ACS_09_13, census$Med_House_value_ACS_09_13)
  1. Lastly, write a function plot.all() which takes as input dataset and nsample, a data frame, and the number of points to be sampled. This function will mimick the behavior of plot() when applied to data frames. In other words, if dataset has p columns, then plot.all() should produce a p by p retangular grid of plots, where the plot in entry i, j of the grid uses column i for the x-axis data, and column j for the y-axis data. Some notes:

A code template for plot.all() is found below. (Hint: using a for() loop for the grid entries is perfectly fine.) Fill it out, and then uncomment the code that follows to plot the pairwise relationships between the 4 census variables Med_HHD_Inc_ACS_09_13, Med_House_value_ACS_09_13, pct_College_ACS_09_13, and Mail_Return_Rate_CEN_2010. Comment on the results; do the relationships make sense to you? [10 pts]

plot.all = function(dataset, nsample=500) {
  p = ncol(dataset)
  orig.par = par()
  
  # Set the margins, and plotting grid
  par(mar=c(2,2,2,2))
  par(mfrow=c(p,p))
  
  # TODO: your plotting code goes here
  
  # Reset the margins, and plotting grid
  par(mar=orig.par$mar)
  par(mfrow=orig.par$mfrow)
}
#mydat = census[,c("Med_HHD_Inc_ACS_09_13", "Med_House_value_ACS_09_13", 
#                  "pct_College_ACS_09_13","Mail_Return_Rate_CEN_2010")]
#plot.all(mydat)

Part 4: Option A, permute away! [30 pts]

Here, we will use what is called permutation test to answer the following question: are the linear regression coefficients between two different states substantially different? Note: a permutation test is a fairly advanced statistical technique, but not much statistical knowledge is required to answer this question. For a bit more (optional) discussion on the permutation test, see the end of this problem.

  1. From the census data frame, extract all rows that correspond to tracts in Virginia, or West Virginia, and call the resulting data frame census.vw. Verify that this has 2321 rows. [1 pt]

  2. Now perform two separate linear regressions using census.vw. The first is a linear regression of Mail_Return_Rate_CEN_2010 (as the response) on pct_NH_White_alone_ACS_09_13 and pct_Renter_Occp_HU_ACS_09_13 (as the predictors), using only the tracts in Virginia; the second is again a linear regression of Mail_Return_Rate_CEN_2010 (as the response) on pct_NH_White_alone_ACS_09_13 and pct_Renter_Occp_HU_ACS_09_13 (as the predictors), but now using only the tracts in West Virginia. Ignoring the intercept terms, each linear regression model here gives you two coefficients (one for pct_NH_White_alone_ACS_09_13 and one for pct_Renter_Occp_HU_ACS_09_13). Compare the two sets of coefficients. Are they the same? [4 pts]

  3. Even though the two coefficients for Virginia and West Virginia may be different, it is hard to tell just how different they are. To help answer this question, we will look at the linear regression coefficients if we were to randomly scramble census tracts between Virginia and West Virginia, then refit the linear regression models. This is the idea behind a permutation test. Vaguely speaking, if the two sets of coefficients were truly the same, then shuffling the labels should have no real effect on our estimated coefficients. First, make a copy of census.vw called census.vw.perm. Then, randomly permute the entries in census.vw.perm$State_name. (Hint: recall the function sample().) Once this is done, rerun the two regression models from Part 4A Question 1 using census.vw.perm, and report the coefficients for Virginia, and for West Virginia. What are their values now? Are the differences between coefficients for Virginia and West Virginia much smaller than they were in Part 4A Question 1? [3 pts]

  4. You will write a function to encapsulate the tasks you performed in the previous questions. The function will be called reg.coef(), and it will take in five arguments:

Your function should perform the following. If shuffle is TRUE, then the entries in census.subset$State_name are randomly permuted. If shuffle is FALSE, then this step is not performed (and census.subset$State_name is left alone). Next, for each state in the census.subset data frame, run a regression of the response, represented by y, on predictor variables, represented by x1 and x2. A matrix with 2 columns is returned, and one row for each state in census.subset. Each row gives the coefficients for x1 and x2 in the linear regression model computed for that particular state (though they are included in the regressions, the intercept terms here are ignored). (Hint: for running regressions at a state-by-state level, use daply().) Recreate the results of Part 4A Question 2 with your function, to check that it gives the same answers. Be sure to use set.seed() function to enforce reproducibility. [8 pts]

  1. Lastly, we will finally implement our permutation test. At a high-level, the idea is to judge differences in coefficients, from (say) Part 4A Question 2 between Virginia and West Virginia, to those in Part 4A Question 3 computed using scrambled data. To make our comparisons to more meaningful, we will repeat the scrambling of state labels (the permutations) multiple times. Write a function permutation.test() that takes in the following seven arguments:

This function will carry out the following tasks, in order:

Once written, uncomment the below to run permutation.test() on the different scenarios, and comment on the results. In each case, do the red dots lie roughly in the point cloud of black dots? If so, that points to the differences between states being insignificant; and if not, that points to the differences between states being significant. [12 pts]

#set.seed(10)
#out1 = permutation.test(census, "Virginia", "West Virginia", "pct_Males_ACS_09_13",
#                        "pct_Pop_5_17_ACS_09_13", "Tot_Population_ACS_09_13")
#out2 = permutation.test(census, "Wisconsin", "Connecticut", "pct_College_ACS_09_13", 
#                        "pct_Renter_Occp_HU_ACS_09_13", "Mail_Return_Rate_CEN_2010")

Statistical background. For the statistically curious. Note that the permutation test we described is one of many ways to determine how substantially different regression coefficients are. Students who have taken other statistical courses might know that using standard errors is another way. The permutation test is in some sense more robust because it assumes less about the data-generating distribution.