Name:
Andrew ID:
Collaborated with:

This lab is to be completed in class. You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an Rmd file on Blackboard, by 11:59pm on the day of the lab.

There are Homework 9 questions dispersed throughout. These must be written up in a separate Rmd document, together with all Homework 9 questions from other labs. Your homework writeup must start as this one: by listing your name, Andrew ID, and who you collaborated with. You must submit your own homework as a knit HTML file on Blackboard, by 11:59pm on Tuesday November 8. This document contains 30 of the 45 total points for Homework 9.

Reading in, exploring wage data

Linear regression modeling

Hw9 Bonus. Show how to extract the standard errors from the output of calling summary() on wage.lm, as a numeric vector.

Hw9 Q3 (10 points). Refit a linear regression model with response variable wage and predictor variables year and age, but this time only using observations in the wage.df data frame for which the wage variable is less than or equal to 250 (note, this is measured in thousands of dollars!). (Hint: you can do this in two ways: (i) you can manually define a new data frame (say) wage.df.lt250 that only keeps observations for which the wage variable is less than or equal to 250, and call lm() on wage.df.lt250; or (ii) you can simply call lm() on wage.df with an appropriate setting to the subset argument.) Call the result wage.lm.lt250. Display a summary, reporting the coefficient estimates of year and age, and their associated standard errors and p-values. Are these coefficients different than before? Are the predictors year and age still significant? Finally, plot diagnostics. Do the “Residuals vs Fitted”, “Normal Q-Q”, “Scale-location”, and “Residuals vs Leverage” plots still have the same problems?

Finally, use your fitted linear model wage.df.lt250 to predict: (a) what a 30 year old person should be making this year; (b) what President Obama should be making this year; (c) what you should be making 5 years from now. Comment on the results—which do you think is the most accurate prediction?

Logistic regression modeling

Hw9 Q4 (10 points). In general, one must be careful fitting a logistic regression model on categorial predictors. We require that the following be true: for each level of the categorical predictor, we have observations at this level for which the response variable is 0, and for which the response variable is 1. In the context of our problem, this means that: for each level of the education variable, we should have people with this education level that have a wage less than or equal to 250, and also people with this education level that have a wage above 250. Which levels of education fail to meet this criterion? (Hint: there is at least one.) Let’s call these “incomplete” levels, and the others the “complete” levels. Refit the logistic regression model in the last question, with the same response and predictors, but only on data in wage.df corresponding to the complete education levels. (Hint: as before, there are two ways: (i) manually define a new data frame; or (ii) use the original data frame and the subset argument of glm().) Display a summary, and comment on the differences seen to the summary for the logistic regression model fitted in the last question. Did any predictors become more significant, according to their p-values?

Generalized additive modeling

Hw9 Q5 (10 points). Install the gam package, if you haven’t already, and load it into your R session with library(gam). Fit a generalized additive model, using gam() with family="binomial", with the response variable being the indicator that wage is larger than 250, and the predictor variables being year, age, and education; as in the last question, only use observations in wage.df corresponding to the complete education levels. Also, in the call to gam(), allow for age to have a nonlinear effect by using s() (leave year and education alone, and they will have the default—linear effects). Call the result wage.gam. Display a summary with summary(). Is the age variable more or less significant, in terms of its p-value, to what you say in the logistic regression model fitted in the last question? Also, plot the effects fit to each predictor, using plot(). Comment on each plot—does the fitted effect make sense to you? In particular, is there a strong nonlinearity associated with the effect of age, and does this make sense?

Finally, using wage.gam, predict the probability that a 30 year old person, who earned a Ph.D., will make over $250,000 in 2016.

Hw9 Bonus. For a 30 year old person who earned a Ph.D., how long does he/she have to wait until there is a predicted probability of at least 13% that he/she makes over $250,000 in that year? Plot his/her probability of earning at least $250,000 over the future years—is this strictly increasing?