Today, we will go over some ways to transform variables and increase flexibility / explanatory power of a model, and a paradigm – training/testing – for avoiding overfitting.
Execute the following code chunk to load the necessary data for this lab with the proper transformation to the Cost
variable:
library(tidyverse)
heart_disease <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease.csv") %>%
mutate(log_cost = log(Cost + 1))
To review from yesterday’s lab: This dataset consists of 788 heart disease patients (608 women, 180 men). Your goal is to predict the log_cost
column, which corresponds to the log transformation of the patient’s total cost of claims by subscriber (i.e., log_cost
is the response variable). You have access to the following explanatory variables:
Age
: Age of subscriber (years)Gender
: Gender of subscriberInterventions
: Total number of interventions or procedures carried outDrugs
: Categorized number of drugs prescribed: 0 if none, 1 if one, 2 if more than oneERVisit
: Number of emergency room visitsComplications
: Whether or not the subscriber had complications: 1 if yes, 0 if noComorbidities
: Number of other diseases that the subscriber hadDuration
: Number of days of duration of treatment conditionRun the following code to fit a model using only the Gender
variable:
gender_cost_lm <- lm(log_cost ~ Gender, data = heart_disease)
Next, use the following code to first create a column called model_preds
containing the predictions of the model above, to display the predictions of this model against the actual log_cast
, but facet
by the patient’s Gender
:
heart_disease %>%
mutate(model_preds = predict(gender_cost_lm)) %>%
ggplot(aes(x = log_cost, y = model_preds)) +
geom_point(alpha = 0.5) +
facet_wrap(~ Gender, ncol = 2) +
theme_bw() +
labs(x = "Actual log(Cost + 1)",
y = "Predicted log(Cost + 1)")
As the figure above, categorical variables make it so we are changing the intercept of our regression line. To make this more clear, view the output of the summary:
summary(gender_cost_lm)
Notice how only one coefficient is provided in addition to the intercept. This is because, by default, R
turns the categorical variables of \(m\) levels (e.g., we have 2 genders in this dataset) into \(m - 1\) indicator variables (binary with values of 1 if in that level versus 0 if not that level) for different categories relative to a baseline level. In this example, R
has created an indicator for one gender: Male
. By default, R
will use alphabetical order to determine the baseline category, which in this example is the gender Female
. The values for the coefficient estimates indicate the expected change in the response variable relative to the baseline. In other words, the intercept term gives us the baseline’s average y, e.g. the average log(Cost
) for male patients. This matches what you displayed in the predictions against observed log_cost
scatterplots by Gender
above.
Beware the default baseline R
picks for categorical variables! We typically want to choose the baseline level to be the group with the most observations. In this example, Female
has the most number of observations so the default was appropriate. But in general, we can change the reference level by modifying the factor
levels of the categorical variables (similar to how we reorder things in ggplot2
). For example, we can use the following code to modify the Gender
variable so that Male
is the baseline (we use fct_relevel()
to update Gender
so that Male
is the first factor level - and we do not need to modify the order of the remaining levels):
heart_disease <- heart_disease %>%
mutate(Gender = fct_relevel(Gender, "Male"))
Refit the linear regression model using Gender
above, how has the summary changed?
After you refit the model above, change the reference level back to Female
witht the following code:
heart_disease <- heart_disease %>%
mutate(Gender = fct_relevel(Gender, "Female"))
Pick a single continuous variable from yesterday, use it to replace INSERT_VARIABLE below, then run the code to fit a model with the Gender
included:
x_gender_cost_lm <- lm(log_cost ~ Gender + INSERT_VARIABLE, data = heart_disease)
Create scatterplots with your predictions on the y-axis, your INSERT_VARIABLE on the x-asis, and color by Gender
. What do you observe?
Another categorical we have access to is the Drugs
variable, which is currently coded as numeric. We can first use the fct_recode()
function to modify the Drugs
variable so that the integers are relabeled:
heart_disease <- heart_disease %>%
mutate(Drugs = fct_recode(as.factor(Drugs),
"None" = "0",
"One" = "1",
"> One" = "2"))
Run the following code to fit a model using only the Drugs
variable:
drugs_cost_lm <- lm(log_cost ~ Drugs, data = heart_disease)
Repeat the same from above that you considered for the Gender
variable, viewing the predictions facetted by Drugs
and assess the summary()
output. Do you think an appropriate reference level was used? (HINT: Use the table()
function on the Drugs
variable to view the overall frequency of each level and determine if the most frequent level was used as the reference.)
Given the similar values, we may decide to collapse the level of One
and > One
into a single level >= One
. We can easily collapse the levels together into a smaller number of categories using fct_collapse()
:
heart_disease <- heart_disease %>%
mutate(drugs_group = fct_collapse(Drugs,
None = c("None"),
`>= One` = c("One", "> One")))
Refit the model with this new drugs_group
variable, but assign it to a different name, e.g. drugs_group_cost_lm
. What changed in the summary?
Remember with ggplot2
you can directly compute and plot the results from running linear regression using geom_smooth()
or stat_smooth()
and specifying that method = "lm"
. Try running the following code (replace INSERT_VARIABLE
!) to generate the linear regression fits with geom_smooth
versus your own model’s predictions (note the different y
mapping for the point versus smooth layers):
heart_disease %>%
mutate(model_preds = predict(x_gender_cost_lm)) %>%
ggplot(aes(x = INSERT_VARIABLE,
color = Gender)) +
geom_point(aes(y = model_preds),
alpha = 0.5) +
theme_bw() +
facet_wrap(~ Gender, ncol = 3) +
labs(x = "INSERT YOUR LABEL HERE",
y = "Predicted log(Cost + 1)") +
geom_smooth(aes(y = log_cost),
method = "lm")
The geom_smooth()
regression lines do NOT match! This is because ggplot2
is fitting separate regressions for each gender, meaning the slope for the continuous variable on the x-axis is changing for each gender We can match the output of the geom_smooth()
results with interactions. We can use interaction terms to build more complex models. Interaction terms allow for a different linear model to be fit for each category; that is, they allow for different slopes across different categories. If we believe relationships between continuous variables, and outcomes, differ across categories, we can use interaction terms to better model these relationships.
To fit a model with an interaction term between two variables, include the interaction via the *
operator like so:
gender_int_cost_lm <- lm(log_cost ~ Gender + INSERT_VARIABLE +
Gender * INSERT_VARIABLE,
data = heart_disease)
Replace the predictions in the previous plot’s mutate
code with this interaction model’s predictions. How do they compare to the results from geom_smooth()
now?
You can model interactions between any type of variables using the *
operator, feel free to experiment on your different possible continuous variables.
Another way to increase the explanatory power of your model is to include transformations of continuous variables. For instance you can directly create a column that is a square of a variable with mutate()
and then fit the regression with the original variable and its squared term:
heart_disease <- heart_disease %>%
mutate(duration_squared = Duration^2)
squared_duration_lm <- lm(log_cost ~ Duration + duration_squared,
data = heart_disease)
summary(squared_duration_lm)
What are some difficulties with interpreting this model fit? View the predictions for this model or other covariates you squared.
The poly()
function allows us to build higher-order polynomial transformations of variables easily. Run the following code chunk to fit a 9th-order polynomial model (i.e. \(Y = \beta_0 + \beta_1x + \beta_2x^2 + \ldots + \beta_9x^9\)) between log_cost
and Duration
.
poly_nine_duration_lm <- lm(log_cost ~ poly(Duration, 9),
data = heart_disease)
summary(poly_nine_duration_lm)
Do you think this is appropriate, how did this change your predictions compared to the previous plot or when only using the variable without any transformation?
As we’ve seen, using transformations such as higher-order polynomials may decrease the interpretability and increase the potential for overfitting associated with our models; however, they can also dramatically improve the explanatory power.
We need a way for making sure our more complicated models have not overly fit to the noise present in our data. Another way of saying this is that a good model should generalize to a different sample than the one on which it was fit. This intuition motivates the idea of training/testing. We split our data into two parts, use one part – the training set – to fit our models, and the other part – the testing set – to evaluate our models. Any model which happens to fit to the noise present in our training data should perform poorly on our testing data.
The first thing we will need to do is split our sample. Run the following code chunk to divide our data into two halves, which we will refer to as a training set and a test set. Briefly summarize what each line in the code chunk is doing.
n_patients <- nrow(heart_disease)
train_i <- sample(n_patients, n_patients / 2, replace = FALSE)
test_i <- (1:n_patients)[-train_i]
heart_train <- heart_disease[train_i,]
heart_test <- heart_disease[test_i,]
We will now compare three candidate models for predicting log_cost
using Gender
and Duration
. We will fit these models on the training data only, ignoring the testing data for the moment. Run the below two code chunks to create two candidate models:
candidate_model_1 <- lm(log_cost ~ poly(Duration, 2) + Gender +
Gender * poly(Duration, 2),
data = heart_train)
candidate_model_2 <- lm(log_cost ~ poly(Duration, 2) + Gender,
data = heart_train)
Using summary()
, which of these models has more explanatory power according to the training data? Which of the models is less likely to overfit?
Fit another model to predict log_cost
using a different set of variables / polynomials.
Now that we’ve built our candidate models, we will evaluate them on our test set, using the criterion of mean squared error (MSE). Run the following code chunk to compute, on the test set, the MSE of predictions given by the first model compared to the actual log_cost
.
model_1_preds <- predict(candidate_model_1, newdata = heart_test)
model_1_mse <- mean((model_1_preds - heart_test$log_cost)^2)
Do this for each of your candidate models. Compare the MSE on the test set, which model performed best (lowest test MSE)?