Name(s):
Andrew ID(s):

Instructions

Make sure you read these.

library(plyr, quiet=TRUE)
library(maps, quiet=TRUE)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:plyr':
## 
##     ozone
library(ggmap, quiet=TRUE)
library(knitr, quiet=TRUE) 
#opts_chunk$set(cache=TRUE, autodep=TRUE, cache.comments=TRUE)

Super important instructions

Doubly make sure you read these.

answers = list(
  name1 = "",
  andrewID1 = "",
  name2 = "",
  andrewID2 = "",
  opt.out = FALSE,
  team.name = ""
)

Background

Believe it or not, back in 1995—an age when computers were mostly deskops, cell phones were huge and clunky, and tweeting was only for the birds (and not US Presidents)—the same colleges that you know and love were still around. This take-home exam will transport you back to this ancient time, before most of you were born, and examine two data sets: the first from the 1995 issue US News & World Report’s Guide to America’s Best Colleges, and the second from the 1995 annual faculty survey run by the American Association of University Professors.

The first data set is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/dat_students.csv. This contains variables relating to students (e.g., number of applications received, averaged tuition cost, graduate rate) for several universities. The second is data set is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/dat_professors.csv. This contains variables relating to faculty (e.g., number of full, associate, and assistant professors, and their salaries), for several universities. In this take-home exam, you will clean the two data sets, merge them, and explore statistical relationships and models. Carnegie Mellon University itself is one of the universities recorded!

To read more about the data sets, including explanations of the variables in each, see http://ww2.amstat.org/publications/jse/datasets/usnews.txt and http://ww2.amstat.org/publications/jse/datasets/aaup.txt.

Part A: Loading, cleaning, exploring (25 points)

  1. Load the two data sets into your R session, saving the data frames as dat.students and dat.professors, respectively. For each load, use read.csv() and set the argument stringsAsFactors to be FALSE. Check that dat.students and dat.professors have dimensions 1302 x 35 and 1161 x 17, respectively. Report the mean of the OutStateTuition column of dat.students, and the mean of Salary.All column of dat.professors, making sure to ignore NA values in each case.
answers$PartA.Q1.avg.tuition = 0 # numeric
answers$PartA.Q1.avg.salary = 0 # numeric
  1. We will first clean dat.students. Define a new version dat.students.clean of the dat.students data frame by following these two steps: (i) remove all columns that have at least 240 NA values, and then (ii) remove all rows that have at least one NA value. Check that the resulting data frame dat.students.clean has dimension 777 x 22, and no NA values. Report the names of the columns that were removed from the original data frame. We will use this cleaned students data set for the rest of the final.

    Add a new column to dat.students.clean called PercentAccepted, which is formed by dividing Num.Accepted by Num.Received, and multiplying by 100. Then answer, based on the cleaned data set: which university has the lowest acceptance rate? The highest out-of-state tuition? The most number of full-time students?

answers$PartA.Q2.removed.cols = c() # string vector
answers$PartA.Q2.lowest.accept = "" # string 
answers$PartA.Q2.highest.outstate = "" # string 
answers$PartA.Q2.most.fulltime = "" # string
  1. Let us explore in a bit more depth a few universities now. Print out the variables (columns) College, Num.Received, Num.Accepted, Top10, OutStateTuition, RoomBoardCosts, SFRatio, PercentAlumni, Expenditure and Graduation for Harvard, Carnegie Mellon, University of Pittsburgh, and Texas A&M at College Station. Use grep() to select the appropriate rows of dat.students.clean for these universities.

    Then report, for each of the 9 numeric variables (all except the College column), which of these 4 colleges had the highest value, along with the value itself.

answers$PartA.Q3.highest.schools = c() # string vector of length 9
answers$PartA.Q3.highest.vals = c() # numeric vector of length 9
  1. We are interested to see if universities with a lower acceptance rate generally have a higher out-of-state tuition. Make a scatter plot of PercentAccepted versus OutStateTuition, with properly labeled axes and a proper title. In addition, abide by the following plotting aesthetics.

    • Color points red for public universities, and blue for private universities. Use legend() to make a legend identifying the two types of schools.

    • Use pch=21 for all universities except the 4 universities we discussed in Q3 above, and for these 4 universities, use pch=19.

    • Use cex=1 for all universities except the 4 universities we discussed above, and for these use cex=2.

    • Use text() to put the labels “HVD”, “CMU”, “PIT”, and “TAM” next to the points for Harvard, Carnegie Mellon, University of Pittsburgh, and Texas A&M at College Station.

    After making this plot, report: is there a general trend that you see?

  2. We will now clean dat.professors. Define a new version dat.professors.clean of the dat.professors data frame by removing any rows with NA values. You should now have 1074 rows in the new data frame. We will use this cleaned professors data set for the rest of the final.

    Answer, based on the cleaned data set: which university has the highest (average) salary for full professors? Associate professors? Assistant professors? And the largest number of (all) faculty?

answers$PartA.Q5.highest.full = "" # string
answers$PartA.Q5.highest.associate = "" # string
answers$PartA.Q5.highest.assistant = "" # string
answers$PartA.Q5.largest.faculty = "" # string
  1. We are interested in seeing the relationships between salaries of associate and full professors. Make a scatter plot of Salary.Full versus Salary.Associate, with appropriately labeled axes (including, specifying the units of these variables on the axes labels) and an appropriate title. Set the x-axis and y-axis limits so that they are equal, and all points are contained in the plot. Also draw the line \(y=x\) as a thick red line on top of the plot. Identify Carnegie Mellon on your plot by putting a label of “CMU” next to its point, and drawing this point in a different color and/or pch value and/or cex value.

    What is the general trend that you see? report: are there any schools for which average full professor salary is actually lower than average associate professor salary? Which ones are they? Also, which school has the highest upward jump from average associate professor salary to average full professor salary? Lastly, how many dollars should Professor Tibshirani expect this salary to rise when (or should we say if?!) he becomes a full professor at Carnegie Mellon? (Currently he is an associate professor; you can ignore any issues to do with inflation here.)

answers$PartA.Q6.full.less.than.assoc = c() # string vector
answers$PartA.Q6.biggest.jump.to.full = "" # string
answers$PartA.Q6.prof.tibs.raise = 0 # numeric

Part B: Aggregating, merging, plotting (30 points)

  1. Using an appropriate method from the apply() family and using dat.students.clean data frame, compute the average in-state tuition per state. Report the averages for Pennsylvania, New York, and California. Also, report: which state has the highest, and which has the lowest? Repeat all of this for out-of-state tuition. Then, additionally report: which state has the biggest absolute difference between average in-state and out-of-state tuitions, and what is this difference?
answers$PartB.Q1.in.tuition.pa = 0 # numeric
answers$PartB.Q1.in.tuition.ny = 0 # numeric
answers$PartB.Q1.in.tuition.ca = 0 # numeric
answers$PartB.Q1.in.tuition.hi = "" # string
answers$PartB.Q1.in.tuition.lo = "" # string
answers$PartB.Q1.out.tuition.pa = 0 # numeric
answers$PartB.Q1.out.tuition.ny = 0 # numeric
answers$PartB.Q1.out.tuition.ca = 0 # numeric
answers$PartB.Q1.out.tuition.hi = "" # string
answers$PartB.Q1.out.tuition.lo = "" # string
answers$PartB.Q1.biggest.diff.state = "" # string
answers$PartB.Q1.biggest.diff.val = 0 # numeric
  1. Recompute the average in-state and out-of-state tuitions per state, but now with an appropriate function from the plyr package. Verify (programmatically) that your results match those computed in the last question.

  2. Using the map() function from the maps package, plot a map of the US, and color the states according to their average in-state tuition, as computed in either Q1 or Q2 above. Use heat.colors() as your color palette. Produce a second plot for the out-of-state tuitions. Title the plots appropriately.

    (Hint: the map() function doesn’t recognize state abbreviations properly, so you’ll have to pass in the appropriate full state names, instead. For this, you’ll find the built-in vectors state.abb and state.name, which give state abbreviations and corresponding state names, respectively. Unfortunately, there’s one state abbreviation in your data set that’s not in the state.abb vector, so you’ll only be able to assign state names to 49 of your 50 states; for this last state, just leave it out when plotting with map().)

  3. Using either an apply() function or plyr function, compute the sum of full-time college students for each state. Report which state has the most full-time students. Then compute the ratio of total full-time students in each state to the population of this state. Recall that the state populations can be found in the built-in matrix state.x77. Here are some important things to be aware of in computing this ratio.

    • The states represented in your dat.students.clean data frame and in the state.x77 matrix aren’t exactly the same; in particular, there’s one state in the dat.students.clean data frame that’s not in the state.x77 matrix. You’ll have to first find the states in dat.students.clean that are also in state.x77. (Hint: for this, you’ll find the built-in vector state.abb helpful, which gives you the state abbreviations for the row names of state.x77.) Then, you’ll compute the ratio for these 49 states, and for the last state (which appears in dat.students.clean but not state.x77) you’ll set its ratio to NA.

    • The population column in state.x77 is measured in thousands of people, so you’ll have to multiply this by 1000.

    • The population column in state.x77 gives state populations in 1975 and the college data was collected in 1995, which means that (in addition to multiplying them by 1000) you’ll have to adjust the populations in state.x77 appropriately. Scale up these populations by noting that there were 212.32 million people living in the US 1975, 266.28 million people living in the US in 1995, and assuming that the state populations all increased equally in this time period.

    After computing the appropriate state-wise full-time students to population ratios, report: which state has the highest ratio, and what is this ratio? Then report the ratios for Pennsylvania, New York, and California.

answers$PartB.Q4.biggest.num.fulltime = "" # string
answers$PartB.Q4.biggest.ratio = "" # string
answers$PartB.Q4.ratio.pa = 0 # numeric
answers$PartB.Q4.ratio.ny = 0 # numeric
answers$PartB.Q4.ratio.ca = 0 # numeric
  1. Now we will merge the data sets dat.students.clean and dat.professors.clean together appropriately by college. Our final result will be a data frame whose rows correspond to the common colleges, and whose columns are a subset of the possible 23 + 17 = 40 total, that represent the unique variables measured (i.e., with duplicate columns removed). First, determine which variables are common to the two data frames. You should find that FICE, which represents the federal ID number, is common to both data frames, as well as a few others. Remove these other common columns from dat.professors.clean, and then merge dat.professors.clean with dat.students.clean by FICE. (Hint: you can either do this manually, or more easily, with merge().) Call the resulting data frame dat.college, and report its dimensions.

    Then, report: how many colleges are present in the (cleaned) students data frame, but not in the merged one? Among these colleges, which one has the largest number of full-time students, and what is this number? Lastly, is this a small or big number, in the context of the numbers of full-time students in the students data frame overall and so have we excluded big colleges in the merged data set? (Your answer to this last question can be informal, but should still be supported by some code.)

answers$PartB.Q5.dat.college.dim = c() # numeric vector of length 2
answers$PartB.Q5.colleges.left.out = 0 # numeric
answers$PartB.Q5.colleges.left.out.biggest.name = "" # string
answers$PartB.Q5.colleges.left.out.biggest.num = 0 # numeric
  1. Do professors get paid more if students are paying more money? To answer this question, first add a column to dat.college called Tuition, the average of InStateTuition and OutStateTuition. Then make a scatter plot of Salary.All versus Tuition, using the data in dat.college. Label the axes and title the plot appropriately. Report the correlation between these two variables, and draw the line of best fit as a thick black line, on top of your plot.
answers$PartB.Q6.tuition.salary.cor = 0 # numeric
  1. In your plot from the last question, there should be two “obvious” groups of points that exhibit different linear trends. This suggests that a single linear model is not appropriate for regressing Salary.All onto Tuition over all the colleges in dat.college, but two linear models should be fit separately to the two groups of points.

    Replot Salary.All versus Tuition, as in the last question, but now color the points according to whether the schools are public or private, and add a corresponding legend. Did this (more of less) visually separate the two groups? Recompute the correlation of Salary.All and Tuition, separately for the public schools and private schools. Did the two correlations values increase compared to the correlation you computed in the last question? Lastly, add two separate lines of best fit, one for the public schools and one for the private schools, to your plot. How do the slopes of these lines compare?

answers$PartB.Q7.tuition.salary.pub.cor = 0 # numeric
answers$PartB.Q7.tuition.salary.priv.cor = 0 # numeric

Part C: Geocoding (20 points)

  1. We would like to know exactly where the colleges in dat.college lie in the US, and the state-level information is too broad. Specifically, we’re going to “geocode” the colleges, meaning compute their longitudes and latitudes, using the geocode() function from the ggmap package. It would not be a bad idea to read through the help file given by typing ?geocode in your console. As a first step, compute the longitude and latitude of Carnegie Mellon, by calling geocode() with location="Carnegie Mellon University, PA" and source="dsk". Report these values—are they accurate? (We’re not looking for you to answer the last part with code, rather, look up the latitude and longitude of Carnegie Mellon somewhere on the web.)

    Calling geocode() with location="Hogwarts School of Witchcraft and Wizardry" and source="dsk", you’ll see that it throws a warning message and returns NA values for longitude and latitude. Additionally, if you turn-off your wireless (or wired) internet connection, and rerun the geocode() command with Carnegie Mellon as in the last question, then you’ll see that it throws a warning message and returns NA values. The geocode() function handles these cases elegantly, but in worse cases it may throw an error.

answers$PartC.Q1.cmu.coords = c() # numeric vector of length 2
  1. Add a new column called Address to the dat.college data frame, given by concatenating together the strings in the College and State columns, separated by a comma and a space. (So, the address for Carnegie Mellon should exactly be “Carnegie Mellon University, PA”.) Also add two new columns to the data frame called Lon and Lat, given initially all NA values. Using a for() loop, fill out the Lon and Lat columns, by calling geocode() with location equal to the appropriate element of the Address column and source="dsk". Importantly, wrap each call to geocode() with a tryCatch() statement, so that you catch any errors that might occur (as alluded to above); in the case on an error, set the longitude and latitude values to be 999. (Hint/warning: this could take a while—even up to 30 minutes!—so it is highly recommended that you cache your results and/or save them directly to a csv file.)

    After the for() loop as finished running, report: how many warnings were encountered? How many errors were encountered? Check that you computed the same longitude and latitude for Carnegie Mellon as you did in the last question; also, report the longitudes and latitudes for Harvard, University of Pittsburgh, and Texas A&M at College Station.

answers$PartC.Q2.hvd.coords = c() # numeric vector of length 2
answers$PartC.Q2.pit.coords = c() # numeric vector of length 2 
answers$PartC.Q2.tam.coords = c() # numeric vector of length 2
  1. Are there any longitude and latitude pairs that are duplicated, across rows of dat.college? If so, how many duplicates are there? (To be clear, if 3 rows have the same longitude and latitude pairs, then we will say there are 3 duplicates.) List the longitude and latitude pair with the most duplicates, and report the addresses of the associated colleges. This happened because the “dsk” source isn’t always very accurate. The “google” source, which is the default in geocode(), is more accurate, but it also has a limit to the number of queries you can make per day (and it generally takes longer). For all the college address associated with the maximum duplicate longitude and latitude pair, recompute their longitude and latitude coordinates using geocode() with source="google". Are the results all distinct now?
answers$PartC.Q3.num.duplicates = 0 # numeric
answers$PartC.Q3.coords.with.most.dup = c() # numeric vector of length 2
answers$PartC.Q3.addresses.with.most.dup = c() # string vector
  1. With the map() function, plot a map of the US, and draw the schools in dat.college on top, according to their longitude and latitude values, as small filled circles in a color of your choosing. What do you notice about the density of the school locations?

  2. Let’s take a closer look at the colleges in Pennsylvania. With the map() function, plot a map of Pennsylvania. Then, on top, draw only the schools in dat.college that are in Pennsylvania, according to the longitude and latitude values, again as filled circles in a color of your choosing. But this time, set the cex parameter to scale with the size of the school, i.e., the number of full-time students: set the smallest school to have a value of cex=0.1, the largest school to have a value of cex=3, and for all other schools, set their cex value by the appropriate interpolation. (E.g., a school that is exactly halfway in between the smallest and largest school would receive a cex value of (0.1 + 3)/2 = 1.55.) What’s the big dot in the middle of the state? Identify Carnegie Mellon on your plot by putting a label of “CMU” next to its point.

Part D: Modeling (35 points)

  1. Plot a histogram of the graduation rates in the dat.college data frame, found in Graduation variable, with an appropriate x-axis label and title, and 30 breaks. What do you notice about its shape—is it roughly Gaussian? Also, what do you notice about its largest value? Explain why such a large value does not make sense and must be an outlier. Report the name of the corresponding school. Then, remove the corresponding row from dat.college, and replot the histogram.
answers$PartD.Q1.grad.outlying.school = "" # string
  1. Recall that the dat.college data frame contains variables for in-state tuition and out-of-state tuition, called InStateTuition and OutStateTuition, respectively. For public schools, we expect these values to be different: public schools typically cost more for out-of-state students. For private schools, we typically expect these values to be the same. Check: are these two variables the same for all private schools in dat.college? If not, for which private schools are they different? For each one in which there is a difference, report the name of the school, the in-state tuition, and the out-of-state tuition.
answers$PartD.Q2.in.out.diff.priv.schools = c() # string vector
answers$PartD.Q2.in.out.diff.in.tuition = c() # numeric vector
answers$PartD.Q2.in.out.diff.out.tuition = c() # numeric vector
  1. We want to model the graduation rate as a function of cost and other variables. First, add a new column called Cost to the dat.college data frame, given by summing OutStateTuition, RoomBoardCosts, and BookCosts. Note that we are using OutStateTuition so that we can more meaningfully compare public and private schools. Report the school names and costs for the 5 schools with the highest cost and the 5 schools with the lowest cost.
answers$PartD.Q3.cost.hi5.schools = c() # string vector of length 5
answers$PartD.Q3.cost.hi5.costs = c() # numeric vector of length 5
answers$PartD.Q3.cost.lo5.schools = c() # string vector of length 5
answers$PartD.Q3.cost.lo5.costs = c() # numeric vector of length 5
  1. Regress Graduation on Cost, using the data in dat.college. Report the estimated coefficients (intercept and coefficient of Cost), their standard errors, and associated p-values. Does Cost appear to have a significant effect? Make a scatter plot of Graduation versus Cost, with appropriate axes labels and an appropriate title, and draw your estimated regression line on top. Then plot diagnostics for the fitted linear model, and briefly comment on them. Finally, predict from your fitted linear model what the graduation rate would be for a school with an annual cost (in terms of out-of-state tuition plus room and board plus books) of 15,000 dollars.
answers$PartD.Q4.lm.coefs = c() # numeric vector of length 2
answers$PartD.Q4.lm.serrs = c() # numeric vector of length 2
answers$PartD.Q4.lm.pvals = c() # numeric vector of length 2
answers$PartD.Q4.lm.pred.grad = 0 # numeric
  1. Instead of just making a single prediction for the graduation rate of a school that costs 15,000 dollars per year, we can form what’s called our prediction interval, from our fitted linear model. (Don’t worry—this question doesn’t require any knowledge of prediction intervals; it’s only about translating a precise mathematical idea into R code.) In mathematical terms, we can define this interval as follows. Let \(C_1,\ldots,C_n\) denote the observed costs in our data set, \(C_0\) the new cost (in our example: 15,000), and \(\hat{G}_0\) the predicted graduation rate at \(C_0\) from the fitted linear model. Then the 90% prediction interval is \[ \big[\hat{G}_0 - z_{0.95} \mathrm{se}(\hat{G}_0), \hat{G}_0 + z_{0.95} \mathrm{se}(\hat{G}_0)\big] \] where \(z_{0.95}\) is the 0.95 quantile of the standard normal distribution, and \[ \mathrm{se}(\hat{G}_0) = \hat\sigma \bigg( 1 + \frac{1}{n} + \frac{(C_0-\bar{C})^2} {\sum_{i=1}^n (C_i - \bar{C})^2} \bigg)^{1/2}. \] To explain the rest of the notation, here \(\hat\sigma\) is the estimated standard deviation of the errors, and \(\bar{C}\) is the average of \(C_1,\ldots,C_n\).

    Note that you can find \(\hat\sigma\) from the summary of the fitted linear model object; report this value. Then, compute and report \(\mathrm{se}(\hat{G}_0)\) according to the above formula, when \(C_0=15,000\). Finally, compute and report the prediction interval around \(\mathrm{se}(\hat{G}_0)\) according to the above formula.

answers$PartD.Q5.sigma = 0 # numeric 
answers$PartD.Q5.se.g0 = 0 # numeric 
answers$PartD.Q5.pred.int = c() # numeric vector of length 2
  1. Is your fitted linear model appropriate for making predictions about graduation rates in 2016, based on raw costs in the year 2016? What could possibly go wrong? Give a numeric example to support your answer.

  2. After having adjusted for cost, is public/private an important variable in predicting graduation rates? To answer this, regress Graduation on Cost and Pub.Priv, using the data in dat.college. You’ll need to make sure that Pub.Priv is treated as a factor (rather than numeric) variable; one way to do this is to convert it as such in the dat.college data frame before running the regression; there are other ways too. What is the coefficient of Pub.Priv? Is the sign what you’d expect? And what is the p-value associated with this coefficent?

answers$PartD.Q7.lm.pub.priv.coef = 0 # numeric
answers$PartD.Q7.lm.pub.priv.pval = 0 # numeric
  1. After having adjusted for cost, is the acceptance rate an important variable in predicting graduation rates? To answer this, regress Graduation on Cost and PercentAccepted, using the data in dat.college. What is the coefficient of PercentAccepted? Is the sign what you’d expect? And what is the p-value associated with this coefficent?
answers$PartD.Q8.lm.perc.appt.coef = 0 # numeric
answers$PartD.Q8.lm.perc.appt.pval = 0 # numeric
  1. Now, to allow for state-specific effects in the linear model, regress Graduation on Cost, PercentAccepted, and State, using the data in dat.college. How many coefficients are there in the fitted linear model (including the intercept), and is this what you would expect? What is the base category, used by the fitted linear model, for the categorical variable State? According to their p-values, are the majority of coefficients for the levels of State significant?

  2. It appears as if many state-level effects are highly significant according to their coefficients; but does the linear model with state-specific effects actually predict better, than the linear model without them? To answer this, we’ll perform sample-splitting, splitting the rows of our data frame dat.college into two data frames, called dat.train and dat.test. On the training data frame, dat.train, we’ll fit two linear models: M1, where we’ll regress Graduation on Cost and PercentAccepted; and M2, where we’ll regress Cost, PercentAccepted, and State. Then we’ll use each fitted linear model, M1 and M2, to predict the graduation rates on the testing data frame, dat.test, and we’ll compare the average squared prediction error incurred by M1 to that incurred by M2. The winner will be declared the more effective predictive model. (Hint: for more details on this general setup, refer to the “Training and Testing Errors” and “Sample-Splitting and Cross-Validation” mini-lectures.)

    Usually, we can just divide the rows of dat.college randomly into two halves, to form the training and testing data sets. There is one subtlety here. Because the model M2 uses state-specific effects, we have to make sure that the training and test data frames have roughly equal amounts of data from each state. Therefore, to form the dat.train and dat.test, you should loop over each state, randomly assign (roughly) half of the rows of dat.train and the rest to dat.test. This can be done with a for() loop, or an apply() statement, either is fine. (Hint: in addition to roughly balancing the data between training and testing sets, it is very important that the training data set has at least one row for each state; otherwise you’ll get errors when you try to make predictions on the test set.)

    After forming dat.train and dat.test, demonstrate that they have roughly equal amounts of data from each state. Then perform training and testing procedure described above, and report the averaged prediction error of M1 and M2. Which is better, i.e., do the state-specific effects actually help in predicting graduation rates?

# IMPORTANT FINAL STEP: change the filename below to be the andrewID of you or
# your partner, uncomment the call to saveRDS() so that when you knit, you will
# save answers as an rds file, and then submit to Blackboard along with your
# final knitted HTML document
# saveRDS(answers, file="yourAndrewIDGoesHere.rds")