Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 9 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 9 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 8. This document contains 30 of the 45 total points for Homework 9.

Reading in, exploring wage data

A data table of dimension 3000 x 11, containing demographic and economic variables measured on individuals living in the mid-Atlantic region, is up at http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/wage.csv. (This has been adapted from the book An Introduction to Statistical Learning.) Load this data table into your R session with read.csv() and save the resulting data frame as wage.df. (Hint: the first several lines of the linked file just explain the nature of the data; open up the file in your web browser, and count how many lines must be skipped before getting to the data; then use an appropriate setting for the skip argument to read.csv().) Check that wage.df has the right dimensions, and display its first 5 rows.
Identify all of the factor variables in wage.df, set up a plotting grid of appropriate dimensions, and then plot of all these factor variables, with appropriate titles. What do you notice about the distributions?
Identify all of the numeric variables in wage.df, set up a plotting grid of appropriate dimensions, and then plot of all these factor variables, with appropriate titles and x-axis labels. What do you notice about the distributions? In particular, what do you notice about the distribution of the wage column? Does it appear to be unimodal (having a single mode)? Does what you see make sense?

Linear regression modeling

Fit a linear regression model, using lm(), with response variable wage and predictor variables year and age, using the wage.df data frame. Call the result wage.lm. Display the coefficient estimates, using coef(), for year and age. Do they have the signs you would expect, i.e., can you explain their signs? Display a summary, using summary(), of this linear model. Report the standard errors and p-values associated with the coefficient estimates for year and age. Do both of these predictors appear to be significant?

Hw9 Bonus. Show how to extract the standard errors from the output of calling summary() on wage.lm, as a numeric vector.

Plot diagnostics of the linear model fit in the previous question, using plot() on wage.lm. Comment on the “Residuals vs Fitted”, “Scale-Location”, and “Residuals vs Leverage” plots—are there any groups of points away from the main bulk of points along the x-axis? Comment on the “Normal Q-Q” plot—do the standardized residuals lie along the line $y=x$? What do you think is causing the discrepancies you are (should be) seeing in these plots? (Hint: look back at the histogram of the wage column you plotted above.)

Hw9 Q3 (10 points). Refit a linear regression model with response variable wage and predictor variables year and age, but this time only using observations in the wage.df data frame for which the wage variable is less than or equal to 250 (note, this is measured in thousands of dollars!). (Hint: you can do this in two ways: (i) you can manually define a new data frame (say) wage.df.lt250 that only keeps observations for which the wage variable is less than or equal to 250, and call lm() on wage.df.lt250; or (ii) you can simply call lm() on wage.df with an appropriate setting to the subset argument.) Call the result wage.lm.lt250. Display a summary, reporting the coefficient estimates of year and age, and their associated standard errors and p-values. Are these coefficients different than before? Are the predictors year and age still significant? Finally, plot diagnostics. Do the “Residuals vs Fitted”, “Normal Q-Q”, “Scale-location”, and “Residuals vs Leverage” plots still have the same problems?

Finally, use your fitted linear model wage.df.lt250 to predict: (a) what a 30 year old person should be making this year; (b) what President Obama should be making this year; (c) what you should be making 5 years from now. Comment on the results—which do you think is the most accurate prediction?

Logistic regression modeling

Fit a logistic regression model, using glm() with family="binomial", with the response variable being the indicator that wage is larger than 250, and the predictor variables being year and age, using the wage.df data set. (Hint: you can set this up in two ways: (i) you can manually define a new column (say) wage.high in the wage.df data frame to be the indicator that the wage column is larger than 250; or (ii) you can define an indicator variable “on-the-fly” in the call to glm() with an appropriate usage of I().) Call the result wage.glm. Display a summary, reporting the coefficient estimates for year and age, their standard errors, and associated p-values. Are the predictors year and age both significant?
Refit a logistic regression model with the same response variable as in the last question, but now with predictors year, age, and education. Note that the third predictor is stored as a factor variable, which we call a categorical variable (rather than a continuous variable, like the first two predictors) in the context of regression modeling. Display a summary. What do you notice about the predictor education—how many coefficients are associated with it in the end? Does this make sense? (Hint: how many levels does education have?)

Hw9 Q4 (10 points). In general, one must be careful fitting a logistic regression model on categorial predictors. We require that the following be true: for each level of the categorical predictor, we have observations at this level for which the response variable is 0, and for which the response variable is 1. In the context of our problem, this means that: for each level of the education variable, we should have people with this education level that have a wage less than or equal to 250, and also people with this education level that have a wage above 250. Which levels of education fail to meet this criterion? (Hint: there is at least one.) Let’s call these “incomplete” levels, and the others the “complete” levels. Refit the logistic regression model in the last question, with the same response and predictors, but only on data in wage.df corresponding to the complete education levels. (Hint: as before, there are two ways: (i) manually define a new data frame; or (ii) use the original data frame and the subset argument of glm().) Display a summary, and comment on the differences seen to the summary for the logistic regression model fitted in the last question. Did any predictors become more significant, according to their p-values?

Generalized additive modeling

Hw9 Q5 (10 points). Install the gam package, if you haven’t already, and load it into your R session with library(gam). Fit a generalized additive model, using gam() with family="binomial", with the response variable being the indicator that wage is larger than 250, and the predictor variables being year, age, and education; as in the last question, only use observations in wage.df corresponding to the complete education levels. Also, in the call to gam(), allow for age to have a nonlinear effect by using s() (leave year and education alone, and they will have the default—linear effects). Call the result wage.gam. Display a summary with summary(). Is the age variable more or less significant, in terms of its p-value, to what you say in the logistic regression model fitted in the last question? Also, plot the effects fit to each predictor, using plot(). Comment on each plot—does the fitted effect make sense to you? In particular, is there a strong nonlinearity associated with the effect of age, and does this make sense?

Finally, using wage.gam, predict the probability that a 30 year old person, who earned a Ph.D., will make over $250,000 in 2016.

Hw9 Bonus. For a 30 year old person who earned a Ph.D., how long does he/she have to wait until there is a predicted probability of at least 13% that he/she makes over $250,000 in that year? Plot his/her probability of earning at least $250,000 over the future years—is this strictly increasing?

Lab 10f: Linear Modeling and Beyond

Statistical Computing, 36-350

Friday November 4, 2016

Reading in, exploring wage data

Linear regression modeling

Logistic regression modeling

Generalized additive modeling