We will briefly review linear modeling, focusing on building and assessing linear models in R. We have four main goals in this lab:
use exploratory data analysis (EDA) and visualization to determine a) whether two variables have a linear relationship, and b) among a set of explanatory variables, which one(s) seem like the best candidates for predicting a given output variable.
fit and interpret simple regression models,
look at diagnostic plots to determine whether a linear model is a good fit for our data,
assess our fitted linear models.
Execute the following code chunk to (a) load the necessary data for this lab, (b) compute four variables we will use in this lab, (c) remove players with missing data (just to simplify things), and (d) subset out players with low minute totals (fewer than 250 minutes played in a season):
library("tidyverse")
nba_data_2022 <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/intro_r/nba_2022_player_stats.csv")
nba_data_2022 <- nba_data_2022 %>%
# Summarize player stats across multiple teams they played for:
group_by(player) %>%
summarize(age = first(age),
position = first(position),
games = sum(games, na.rm = TRUE),
minutes_played = sum(minutes_played, na.rm = TRUE),
field_goals = sum(field_goals, na.rm = TRUE),
field_goal_attempts = sum(field_goal_attempts, na.rm = TRUE),
three_pointers = sum(three_pointers, na.rm = TRUE),
three_point_attempts = sum(three_point_attempts, na.rm = TRUE),
free_throws = sum(free_throws, na.rm = TRUE),
free_throw_attempts = sum(free_throw_attempts, na.rm = TRUE)) %>%
ungroup() %>%
mutate(field_goal_percentage = field_goals / field_goal_attempts,
three_point_percentage = three_pointers / three_point_attempts,
free_throw_percentage = free_throws / free_throw_attempts,
min_per_game = minutes_played / games) %>%
# Remove rows with missing missing values
drop_na() %>%
filter(minutes_played > 250)
In the National Basketball Association (NBA), and more generally in team sports, a coach must make decisions about how many minutes each player should play. Typically, these decisions are informed by a player’s skills, along with other factors such as fatigue, matchups, etc. Our goal is to use measurements of a few (quantifiable) player attributes to predict the minutes per game a player plays. In particular, we will focus on the following data, measured over the 2022 NBA regular season for over 400 players:
player
: names of each player (not useful for modeling purposes, but just for reference)min_per_game
: our response variable, measuring the minutes per game a player played during the 2022 NBA regular season.field_goal_percentage
: potential (continuous) explanatory variable, calculated as (number of made field goals) / (number of field goals attempted).free_throw_percentage
: potential (continuous) explanatory variable, calculated as (number of made free throws) / (number of free throws attempted).three_point_percentage
: potential (continuous) explanatory variable, calculated as (number of made 3 point shots) / (number of 3 point shots attempted),age
: potential (continuous / discrete) explanatory variable, player’s reported age for the 2022 season,position
: potential (categorical) explanatory variable, one of SG
(shooting guard), PG
(point guard), C
(center), PF
(power forward) or SF
(small forward).Spend time exploring the dataset, to visually assess which of the explanatory variables listed above is most associated with our response the minutes played per game (min_per_game
). Create scatterplots between the response and each continuous explanatory variable. Do any of the relationship appear to be linear? Describe the direction and strength of the association between the explanatory and response variables.
In your opinion, which of the possible continuous explanatory variables displays the strongest relationship with minutes per game?
Create an appropriate visualization comparing the distribution of minutes per game by position. Do you think there is a relationship between minutes per game and position?
Now that you’ve performed some EDA, it’s time to actually fit some linear models to the data. Start the variable you think displays the strongest relationship with the response variable. Update the following code by replacing INSERT_VARIABLE with your selected variable, and run to fit the model:
init_nba_lm <- lm(min_per_game ~ INSERT_VARIABLE, data = nba_data_2022)
Before check out the summary()
of this model, you need to check the diagnostics to see if it meets the necessary assumptions. To do this you can try running plot(init_nba_lm)
in the console (what happens?). Equivalently, another way to make the same plots but with ggplot2
perks is with the ggfortify
package by running the following code:
# First install the package by running the following line (uncomment it!) in the console
# install.packages("ggfortify")
library(ggfortify)
autoplot(init_nba_lm) +
theme_bw()
The first plot is residuals vs. fitted: this plot should NOT display any clear patterns in the data, no obvious outliers, and be symmetric around the horizontal line at zero. The smooth line provided is just for reference to see how the residual average changes. Do you see any obvious patterns in your plot for this model?
The second plot is a Q-Q plot (p. 93). Without getting too much into the math behind them, the closer the observations are to the dashed reference line, the better your model fit is. It is bad for the observations to diverge from the dashed line in a systematic way - that means we are violating the assumption of normality discussed in lecture. How do your points look relative to the dashed reference line?
The third plot looks at the square root of the absolute value of the standardized residiuals. We want to check for homoskedascity of errors (equal, constant variance). If we did have constant variance, what would we expect to see? What does your plot look like?
The fourth plot is residuals vs. leverage which helps us identify influential points. Leverage quanitifies the influence the observed response for a particular observation has on its predicted value, i.e. if the leverage is small then the observed response has a small role in the value of its predicted reponse, while a large leverage indicates the observed response plays a large role in the predicted response. Its a value between 0 and 1, where the sum of all leverage values equals the number of coefficients (including the intercept). Specifically the leverage for observation \(i\) is computed as:
\[h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_i^n (x_i - \bar{x})^2}\] where \(\bar{x}\) is the average value for variable \(x\) across all observations. See page 191 for more details on leverage and the regression hat matrix. We’re looking for points in the upper right or lower right corners, where dashed lines for Cook’s distance values would indicate potential outlier points that are displaying too much influence on the model results. Do you observed any such influential points in upper or lower right corners?
What is your final assessment of the diagnostics, do you believe all assumptions are met? Any potential outlier observations to remove?
Following the example in lecture, interpret the results from the summary()
function on your initial model. Do you think there is sufficient evidence to reject the null hypothesis that the coefficient is 0? What is the interpretation of the \(R^2\) value? Compare the square root of the raw (unadjusted) \(R^2\) of your linear model to the correlation between that explanatory variable and the response using the cor()
function (e.g. cor(nba_data_2022$age, nba_data_2022$min_per_game)
- but replace age
with your variable). What do you notice?
To assess the fit of a linear model, we can also plot the predicted values vs the actual values, to see how closely our predictions align with reality, and to decide whether our model is making any systematic errors. Execute the following code chunk to show the actual minutes per game against our model’s predictions
nba_data_2022 %>%
mutate(init_preds = predict(init_nba_lm)) %>%
ggplot(aes(x = init_preds, y = min_per_game)) +
geom_point(alpha = 0.75) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "red") +
theme_bw() +
labs(x = "Predictions", y = "Observed minutes / game")
Which of the variables do you think is the most appropriate variable for modeling the minutes per game?
Repeat steps 2 and 3 above but including more than one variable in your model. You can easily do this in the lm()
function by adding another variable to the formula with the +
operator as so (but just replace the INSERT_VARIABLE_X
parts):
multi_nba_lm <- lm(min_per_game ~ INSERT_VARIABLE_1 + INSERT_VARIABLE_2,
data = nba_data_2022)
Experiment with different sets of the continuous variables. What sets of continuous variables do you think model minutes per game best? (Remember to use the Adjusted \(R^2\) when comparing models that have different numbers of variables).
Beware collinearity! Load the car
library (install it if necessary!) and use the vif()
function to check for possible (multi)collinearity. The vif()
function computes the variance inflation factor (VIF) where for predictor \(x_j\) for \(j \in 1,\dots, p\):
\[
VIF_j = \frac{1}{1 - R^2_j}
\] where \(R^2_j\) is the \(R^2\) from a variable with variable \(x_j\) as the response and the other \(p-1\) predictors as the explanatory variables. VIF values close to 1 indicate the variable is not correlated with other predictors, while VIF values over 5 indicate strong presence of collinearity. If present, remove a variable with VIF over 5, and redo the fit. Rinse, lather, and repeat until the vif()
outputs are all less than 5. The follow code chunk displays an example of using this function:
# First install the package by uncommenting out the following line:
# install.packages("car")
library(car)
vif(multi_nba_lm)
Tomorrow’s lab will focus on categorical variables, interactions, and holdout data predictions.