Name:
Andrew ID:
Collaborated with:
This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted PDF file on Gradescope, by Thursday 9pm, this week.
This week’s agenda: exploratory data analysis, cleaning data, fitting linear/logistic models, and using associated utility functions.
Below we read in the prostate cancer data set that we looked in previous labs.
pros.df =
read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
## [1] 97 9
head(pros.df, 3)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
pros.df.subset
to be the subset of observations (rows) of the prostate data set such the lcp
measurement is greater than the minimum value (the minimum value happens to be log(0.25)
, but you should not hardcode this value and should work it out from the data). As in lecture, plot histograms of all of the variables in pros.df.subset
. Comment on any differences you see between these distributions and the ones in lecture.# YOUR CODE GOES HERE
pros.df.subset
. Report the two highest correlations between pairs of (distinct) variables, and also report the names of the associated variables. Are these different from answers that were computed on the full data set?# YOUR CODE GOES HERE
pros.df.subset
. For this heatmap, use the full matrix (not just its upper triangular part). Makes sure your heatmap is displayed in a sensible way and that it’s clear what the variables are in the plot. For full points, create your heatmap using base R graphics (hint: the clockwise90()
function from the “Plotting tools” lecture will be handy); for partial points, use an R package.# YOUR CODE GOES HERE
lm()
, a linear regression model of lpsa
(log PSA score) on lcavol
(log cancer volume). Do this twice: once with the full data set, pros.df
, and once with the subsetted data, pros.df.subset
. Save the results as pros.lm.
and pros.subset.lm
, respectively. Using coef()
, display the coefficients (intercept and slope) from each linear regression. Are they different?# YOUR CODE GOES HERE
lpsa
versus lcavol
, using the full set of observations, in pros.df
. Label the axes appropriately. Then, mark the observations in pros.df.subset
by small filled red circles. Add a thick black line to your plot, displaying the fitted regression line from pros.lm
. Add a thick red line, displaying the fitted regression line from pros.subset.lm
. Add a legend that explains the color coding.# YOUR CODE GOES HERE
lpsa
on lcavol
, but now on two different subsets of the data: the first consisting of patients with SVI, and the second consistent of patients without SVI. Display the resulting coefficients (intercept and slope) from each model, and produce a plot just like the one in the last question, to visualize the different regression lines on top of the data. Do these two regression lines differ, and in what way?# YOUR CODE GOES HERE
read.csv()
, setting stringsAsFactors = TRUE
in this function call, and save the resulting data frame as wage.df
. Check that wage.df
has the right dimensions, and display its first 3 rows. Hint: the first several lines of the linked file just explain the nature of the data; open up the file (either directly in your web browser or after you download it to your computer), and count how many lines must be skipped before getting to the data; then use an appropriate setting for the skip
argument to read.csv()
.# YOUR CODE GOES HERE
wage.df
, set up a plotting grid of appropriate dimensions, and then plot each of these factor variables, with appropriate titles. What do you notice about the distributions?# YOUR CODE GOES HERE
wage.df
, set up a plotting grid of appropriate dimensions, and then plot histograms of each these numeric variables, with appropriate titles and x-axis labels. What do you notice about the distributions? In particular, what do you notice about the distribution of the wage
column? Does it appear to be unimodal (having a single mode)? Does what you see make sense?# YOUR CODE GOES HERE
lm()
, with response variable wage
and predictor variables year
and age
, using the wage.df
data frame. Call the result wage.lm
. Display the coefficient estimates, using coef()
, for year
and age
. Do they have the signs you would expect, i.e., can you explain their signs? Display a summary, using summary()
, of this linear model. Report the standard errors and p-values associated with the coefficient estimates for year
and age
. Do both of these predictors appear to be significant, based on their p-values?# YOUR CODE GOES HERE
year
and age
into a vector called wage.se
, and print it out to the console. Don’t just type the values in you see from summary()
; you need to determine these values programmatically. Hint: define wage.sum
to be the result of calling summary()
on wage.lm
; then figure out what kind of R object wage.sum
is, and how you can extract the standard errors.# YOUR CODE GOES HERE
plot()
on wage.lm
. Look at the “Residuals vs Fitted”, “Scale-Location”, and “Residuals vs Leverage” plots—are there any groups of points away from the main bulk of points along the x-axis? Look at the “Normal Q-Q” plot—do the standardized residuals lie along the line \(y=x\)? Note: don’t worry too if you’re generally unsure how to interpret these diagnostic plots; you’ll learn a lot more in your Modern Regression 36-401 course; for now, you can just answer the questions we asked. Challenge: what is causing the discrepancies you are (should be) seeing in these plots? Hint: look back at the histogram of the wage
column you plotted above.# YOUR CODE GOES HERE
wage
and predictor variables year
and age
, but this time only using observations in the wage.df
data frame for which the wage
variable is less than or equal to 250 (note, this is measured in thousands of dollars!). Call the result wage.lm.lt250
. Display a summary, reporting the coefficient estimates of year
and age
, their standard errors, and associated p-values. Are these coefficients different than before? Are the predictors year
and age
still significant? Finally, plot diagnostics. Do the “Residuals vs Fitted”, “Normal Q-Q”, “Scale-location”, and “Residuals vs Leverage” plots still have the same problems as before?# YOUR CODE GOES HERE
wage.lm.lt250
to predict: (a) what a 30 year old person should be making this year; (b) what President Trump should be making this year; (c) what you should be making 5 years from now. Comment on the results—which do you think is the most accurate prediction?# YOUR CODE GOES HERE
glm()
with family="binomial"
, with the response variable being the indicator that wage
is larger than 250, and the predictor variables being year
and age
. Call the result wage.glm
. Note: you can set this up in two different ways: (i) you can manually define a new column (say) wage.high
in the wage.df
data frame to be the indicator that the wage
column is larger than 250; or (ii) you can define an indicator variable “on-the-fly” in the call to glm()
with an appropriate usage of I()
. Display a summary, reporting the coefficient estimates for year
and age
, their standard errors, and associated p-values. Are the predictors year
and age
both significant?# YOUR CODE GOES HERE
year
, age
, and education
. Note that the third predictor is stored as a factor variable, which we call a categorical variable (rather than a continuous variable, like the first two predictors) in the context of regression modeling. Display a summary. What do you notice about the predictor education
: how many coefficients are associated with it in the end? Can you explain why the number of coefficients associated with education
makes sense?# YOUR CODE GOES HERE
education
variable, we should have people at this education level that have a wage less than or equal to 250, and also people at this education level that have a wage above 250. Which levels of education
fail to meet this criterion? Let’s call these levels “incomplete”, and the other levels “complete”.# YOUR CODE GOES HERE
wage.df
that corresponds to the incomplete education levels (equivalently, using only the data from the complete education levels). Display a summary, and comment on the differences seen to the summary for the logistic regression model fitted in Q4b. Did any predictors become more significant, according to their p-values?# YOUR CODE GOES HERE
gam
package, if you haven’t already, and load it into your R session with library(gam)
. Fit a generalized additive model, using gam()
with family="binomial"
, with the response variable being the indicator that wage
is larger than 250, and the predictor variables being year
, age
, and education
; as in the last question, only use observations in wage.df
corresponding to the complete education levels. Also, in the call to gam()
, allow for age
to have a nonlinear effect by using s()
(leave year
and education
alone, and they will have the default—linear effects). Call the result wage.gam
. Display a summary with summary()
. Is the age
variable more or less significant, in terms of its p-value, to what you saw in the logistic regression model fitted in the last question? Also, plot the fitted effect for each predictor, using plot()
. Comment on the plots—does the fitted effect make sense to you? In particular, is there a strong nonlinearity associated with the effect of age
, and does this make sense?# YOUR CODE GOES HERE
wage.gam
, predict the probability that a 30 year old person, who earned a Ph.D., will make over $250,000 in 2018.# YOUR CODE GOES HERE
# YOUR CODE GOES HERE