Goals

Today, we will go over some ways to transform variables and increase flexibility / explanatory power of a model, and a paradigm – training/testing – for avoiding overfitting.

Data

Execute the following code chunk to load the necessary data for this lab with the proper transformation to the Cost variable:

library(tidyverse)
heart_disease <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/health/intro_r/heart_disease.csv") %>%
  mutate(log_cost = log(Cost + 1))

To review from yesterday’s lab: This dataset consists of 788 heart disease patients (608 women, 180 men). Your goal is to predict the log_cost column, which corresponds to the log transformation of the patient’s total cost of claims by subscriber (i.e., log_cost is the response variable). You have access to the following explanatory variables:

Exercises

1. Linear model with one categorical variable

Run the following code to fit a model using only the Gender variable:

gender_cost_lm <- lm(log_cost ~ Gender, data = heart_disease)

Next, use the following code to first create a column called model_preds containing the predictions of the model above, to display the predictions of this model against the actual log_cast, but facet by the patient’s Gender:

heart_disease %>%
  mutate(model_preds = predict(gender_cost_lm)) %>%
  ggplot(aes(x = log_cost, y = model_preds)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ Gender, ncol = 2) +
  theme_bw() +
  labs(x = "Actual log(Cost + 1)", 
       y = "Predicted log(Cost + 1)")

As the figure above, categorical variables make it so we are changing the intercept of our regression line. To make this more clear, view the output of the summary:

summary(gender_cost_lm)

Notice how only one coefficient is provided in addition to the intercept. This is because, by default, R turns the categorical variables of \(m\) levels (e.g., we have 2 genders in this dataset) into \(m - 1\) indicator variables (binary with values of 1 if in that level versus 0 if not that level) for different categories relative to a baseline level. In this example, R has created an indicator for one gender: Male. By default, R will use alphabetical order to determine the baseline category, which in this example is the gender Female. The values for the coefficient estimates indicate the expected change in the response variable relative to the baseline. In other words, the intercept term gives us the baseline’s average y, e.g. the average log(Cost) for male patients. This matches what you displayed in the predictions against observed log_cost scatterplots by Gender above.

Beware the default baseline R picks for categorical variables! We typically want to choose the baseline level to be the group with the most observations. In this example, Female has the most number of observations so the default was appropriate. But in general, we can change the reference level by modifying the factor levels of the categorical variables (similar to how we reorder things in ggplot2). For example, we can use the following code to modify the Gender variable so that Male is the baseline (we use fct_relevel() to update Gender so that Male is the first factor level - and we do not need to modify the order of the remaining levels):

heart_disease <- heart_disease %>%
  mutate(Gender = fct_relevel(Gender, "Male")) 

Refit the linear regression model using Gender above, how has the summary changed?

After you refit the model above, change the reference level back to Female witht the following code:

heart_disease <- heart_disease %>%
  mutate(Gender = fct_relevel(Gender, "Female")) 

2. Linear model with one categorical AND one continuous variable

Pick a single continuous variable from yesterday, use it to replace INSERT_VARIABLE below, then run the code to fit a model with the Gender included:

x_gender_cost_lm <- lm(log_cost ~ Gender + INSERT_VARIABLE, data = heart_disease)

Create scatterplots with your predictions on the y-axis, your INSERT_VARIABLE on the x-asis, and color by Gender. What do you observe?

3. Collapsing categorical variables

Another categorical we have access to is the Drugs variable, which is currently coded as numeric. We can first use the fct_recode() function to modify the Drugs variable so that the integers are relabeled:

heart_disease <- heart_disease %>%
  mutate(Drugs = fct_recode(as.factor(Drugs),
                            "None" = "0",
                            "One" = "1",
                            "> One" = "2"))

Run the following code to fit a model using only the Drugs variable:

drugs_cost_lm <- lm(log_cost ~ Drugs, data = heart_disease)

Repeat the same from above that you considered for the Gender variable, viewing the predictions facetted by Drugs and assess the summary() output. Do you think an appropriate reference level was used? (HINT: Use the table() function on the Drugs variable to view the overall frequency of each level and determine if the most frequent level was used as the reference.)

Given the similar values, we may decide to collapse the level of One and > One into a single level >= One. We can easily collapse the levels together into a smaller number of categories using fct_collapse():

heart_disease <- heart_disease %>%
  mutate(drugs_group = fct_collapse(Drugs,
                                       None = c("None"),
                                       `>= One` = c("One", "> One"))) 

Refit the model with this new drugs_group variable, but assign it to a different name, e.g. drugs_group_cost_lm. What changed in the summary?

4. Interactions

Remember with ggplot2 you can directly compute and plot the results from running linear regression using geom_smooth() or stat_smooth() and specifying that method = "lm". Try running the following code (replace INSERT_VARIABLE!) to generate the linear regression fits with geom_smooth versus your own model’s predictions (note the different y mapping for the point versus smooth layers):

heart_disease %>%
  mutate(model_preds = predict(x_gender_cost_lm)) %>%
  ggplot(aes(x = INSERT_VARIABLE, 
             color = Gender)) +
  geom_point(aes(y = model_preds),
             alpha = 0.5) +
  theme_bw() +
  facet_wrap(~ Gender, ncol = 3) +
  labs(x = "INSERT YOUR LABEL HERE", 
       y = "Predicted log(Cost + 1)") +
  geom_smooth(aes(y = log_cost),
              method = "lm") 

The geom_smooth() regression lines do NOT match! This is because ggplot2 is fitting separate regressions for each gender, meaning the slope for the continuous variable on the x-axis is changing for each gender We can match the output of the geom_smooth() results with interactions. We can use interaction terms to build more complex models. Interaction terms allow for a different linear model to be fit for each category; that is, they allow for different slopes across different categories. If we believe relationships between continuous variables, and outcomes, differ across categories, we can use interaction terms to better model these relationships.

To fit a model with an interaction term between two variables, include the interaction via the * operator like so:

gender_int_cost_lm <- lm(log_cost ~ Gender + INSERT_VARIABLE +
                       Gender * INSERT_VARIABLE, 
                   data = heart_disease)

Replace the predictions in the previous plot’s mutate code with this interaction model’s predictions. How do they compare to the results from geom_smooth() now?

You can model interactions between any type of variables using the * operator, feel free to experiment on your different possible continuous variables.

5. Polynomials

Another way to increase the explanatory power of your model is to include transformations of continuous variables. For instance you can directly create a column that is a square of a variable with mutate() and then fit the regression with the original variable and its squared term:

heart_disease <- heart_disease %>%
  mutate(duration_squared = Duration^2)
squared_duration_lm <- lm(log_cost ~ Duration + duration_squared, 
                    data = heart_disease)
summary(squared_duration_lm)

What are some difficulties with interpreting this model fit? View the predictions for this model or other covariates you squared.

The poly() function allows us to build higher-order polynomial transformations of variables easily. Run the following code chunk to fit a 9th-order polynomial model (i.e. \(Y = \beta_0 + \beta_1x + \beta_2x^2 + \ldots + \beta_9x^9\)) between log_cost and Duration.

poly_nine_duration_lm <- lm(log_cost ~ poly(Duration, 9), 
                      data = heart_disease)
summary(poly_nine_duration_lm)

Do you think this is appropriate, how did this change your predictions compared to the previous plot or when only using the variable without any transformation?

6. Training and testing

As we’ve seen, using transformations such as higher-order polynomials may decrease the interpretability and increase the potential for overfitting associated with our models; however, they can also dramatically improve the explanatory power.

We need a way for making sure our more complicated models have not overly fit to the noise present in our data. Another way of saying this is that a good model should generalize to a different sample than the one on which it was fit. This intuition motivates the idea of training/testing. We split our data into two parts, use one part – the training set – to fit our models, and the other part – the testing set – to evaluate our models. Any model which happens to fit to the noise present in our training data should perform poorly on our testing data.

The first thing we will need to do is split our sample. Run the following code chunk to divide our data into two halves, which we will refer to as a training set and a test set. Briefly summarize what each line in the code chunk is doing.

n_patients <- nrow(heart_disease)
train_i <- sample(n_patients, n_patients / 2, replace = FALSE)
test_i <- (1:n_patients)[-train_i]
heart_train <- heart_disease[train_i,]
heart_test <- heart_disease[test_i,]

We will now compare three candidate models for predicting log_cost using Gender and Duration. We will fit these models on the training data only, ignoring the testing data for the moment. Run the below two code chunks to create two candidate models:

candidate_model_1 <- lm(log_cost ~ poly(Duration, 2) + Gender +
                          Gender * poly(Duration, 2), 
                        data = heart_train)
candidate_model_2 <- lm(log_cost ~ poly(Duration, 2) + Gender, 
                        data = heart_train)

Using summary(), which of these models has more explanatory power according to the training data? Which of the models is less likely to overfit?

Fit another model to predict log_cost using a different set of variables / polynomials.

Now that we’ve built our candidate models, we will evaluate them on our test set, using the criterion of mean squared error (MSE). Run the following code chunk to compute, on the test set, the MSE of predictions given by the first model compared to the actual log_cost.

model_1_preds <- predict(candidate_model_1, newdata = heart_test)
model_1_mse <- mean((model_1_preds - heart_test$log_cost)^2)

Do this for each of your candidate models. Compare the MSE on the test set, which model performed best (lowest test MSE)?