Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Intro to variable selection
### June 28th, 2022

---

## The setting

We wish to learn a linear model. Our estimate (denoted by hats) is
$$
\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p
$$

--
Why would we attempt to select a __subset__ of the `$p$` variables?

--
- *To improve prediction accuracy*

- Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias

--
- *To improve model interpretability*

- Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response.

---

## Best subset selection

- Start with the __null model__ `$\mathcal{M}_0$` (intercept-only) that has no predictors

- just predicts the sample mean for each observation

--
- For `$k = 1, 2, \dots, p$` (each possible number of predictors)

- Fit __all__ `$\binom{p}{k} = \frac{p!}{k!(p-k)!}$` with exactly `$k$` predictors
  
  - Pick the best (some criteria) among these `$\binom{p}{k}$` models, call it `$\mathcal{M}_k$`
  
    - Best can be up to the user: cross-validation error, highest adjusted `$R^2$`, etc.

--
- Select a single best model from among `$\mathcal{M}_0, \dots, \mathcal{M}_p$`

--
__This is not typically used in research!__

- only practical for a smaller number of variables

--
- arbitrary way of defining __best__ and ignores __prior knowledge__ about potential predictors

---

## Use the shoe leather approach

[Prof. David Freeman](https://en.wikipedia.org/wiki/David_A._Freedman):

- algorithms can be tempting but they are NOT substitutes!

- you should NOT avoid the hard work of EDA in your modeling efforts

__Variable selection is a difficult problem!__

- Like much of a statistics & data science research there is not one unique, correct answer

--
You should justify which predictors / variables used in modeling based on:

- __context__,

- __extensive EDA__, and

- __model assessment based on holdout predictions__

---

## Covariance and correlation

- __Covariance__ is a measure of the __linear__ dependence between two variables

- To be _"uncorrelated"_ is not the same as to be _"independent"_...
  
  - Independence means __there is no dependence__, linear or otherwise

--
- __Correlation__ is a _normalized_ form of covariance, ranges from -1 through 0 to 1

- -1 means one variable linearly decreases absolutely in value while the other increases in value
  
  - 0 means no linear dependence
  
  - 1 means one variable linear increases absolutely while the other increases

- We can use the `cov()` / `cor()` functions in `R` to generate the __covariance__ / __correlation__ matrices

---

## Example data: NFL teams summary

Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021

```r
library(tidyverse)
nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv")
nfl_teams_data
```

```
## # A tibble: 733 x 55
##    season team  offense_completi… offense_total_ya… offense_total_y… offense_ave_yar…
##     <dbl> <chr>             <dbl>             <dbl>            <dbl>            <dbl>
##  1   1999 ARI               0.477              2796             1209             4.67
##  2   1999 ATL               0.504              3317             1176             6.08
##  3   1999 BAL               0.452              2805             1663             5.07
##  4   1999 BUF               0.540              3275             2038             6.17
##  5   1999 CAR               0.552              4144             1484             6.68
##  6   1999 CHI               0.561              4090             1359             5.75
##  7   1999 CIN               0.498              3178             1971             5.37
##  8   1999 CLE               0.489              2574             1140             4.71
##  9   1999 DAL               0.560              3083             2054             5.95
## 10   1999 DEN               0.546              3378             1852             5.85
## # … with 723 more rows, and 49 more variables: offense_ave_yards_gained_run <dbl>,
## #   offense_total_air_yards <dbl>, offense_ave_air_yards <dbl>,
## #   offense_total_yac <dbl>, offense_ave_yac <dbl>, offense_n_plays_pass <dbl>,
## #   offense_n_plays_run <dbl>, offense_n_interceptions <dbl>,
## #   offense_n_fumbles_lost_pass <dbl>, offense_n_fumbles_lost_run <dbl>,
## #   offense_total_epa_pass <dbl>, offense_total_epa_run <dbl>,
## #   offense_ave_epa_pass <dbl>, offense_ave_epa_run <dbl>,
## #   offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>,
## #   offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>,
## #   offense_success_rate_pass <dbl>, offense_success_rate_run <dbl>,
## #   defense_completion_percentage <dbl>, defense_total_yards_gained_pass <dbl>,
## #   defense_total_yards_gained_run <dbl>, defense_ave_yards_gained_pass <dbl>,
## #   defense_ave_yards_gained_run <dbl>, defense_total_air_yards <dbl>,
## #   defense_ave_air_yards <dbl>, defense_total_yac <dbl>, defense_ave_yac <dbl>,
## #   defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>,
## #   defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>,
## #   defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>,
## #   defense_total_epa_run <dbl>, defense_ave_epa_pass <dbl>,
## #   defense_ave_epa_run <dbl>, defense_total_wpa_pass <dbl>,
## #   defense_total_wpa_run <dbl>, defense_ave_wpa_pass <dbl>,
## #   defense_ave_wpa_run <dbl>, defense_success_rate_pass <dbl>,
## #   defense_success_rate_run <dbl>, points_scored <dbl>, points_allowed <dbl>,
## #   wins <dbl>, losses <dbl>, ties <dbl>
```

---

## Modeling NFL score differential

.pull-left[

Interested in modeling a team's __score differential__

```r
nfl_teams_data <- nfl_teams_data %>%
  mutate(score_diff = 
           points_scored - points_allowed)
nfl_teams_data %>%
  ggplot(aes(x = score_diff)) +
  geom_histogram(color = "black", 
                 fill = "darkblue",
                 alpha = 0.3) +
  theme_bw() +
  labs(x = "Score differential")
```

]
.pull-right[

]

---

### Correlation matrix of score differential and candidate predictors

.pull-left[

- Interested in `score_diff` relationships with team passing and rush statistics

- View the correlation matrix with [`ggcorrplot`](https://rpkgs.datanovia.com/ggcorrplot/)

```r
library(ggcorrplot)
nfl_model_data <- nfl_teams_data %>%
  dplyr::select(score_diff, offense_ave_epa_pass,
                offense_ave_epa_run, 
                defense_ave_epa_pass,
                defense_ave_epa_run,
                offense_ave_yards_gained_pass,
                offense_ave_yards_gained_run,
                defense_ave_yards_gained_pass,
                defense_ave_yards_gained_run)
*nfl_cor_matrix <- cor(nfl_model_data)
*ggcorrplot(nfl_cor_matrix)
```

]
.pull-right[
<img src="12-Varsel_files/figure-html/unnamed-chunk-3-1.png" width="504" />

]

---

## Customize the appearance of the correlation matrix

.pull-left[
- Avoid redundancy by only using one half of matrix with `type`

- Add correlation value labels using `lab` (but round first!)

- Can arrange variables based on clustering...

```r
round_cor_matrix <- 
* round(cor(nfl_model_data), 2)
ggcorrplot(round_cor_matrix, 
*          hc.order = TRUE,
*          type = "lower",
*          lab = TRUE)
```

]
.pull-right[

<img src="12-Varsel_files/figure-html/unnamed-chunk-4-1.png" width="504" />
]

---

## Clustering variables using the correlation matrix

Apply [hierarchical clustering](http://stat.cmu.edu/cmsac/sure/2021/materials/lectures/slides/07-Hierarchical-clustering.html#1) to variables instead of observations

--
- Select the explanatory variables of interest from our data

```r
nfl_ex_vars <- dplyr::select(nfl_model_data, -score_diff)
```

- Compute correlation matrix of these variables:

```r
exp_cor_matrix <- cor(nfl_ex_vars)
```

- Correlations measure similarity and can be negative __BUT__ distances measure dissimilarity and __CANNOT__

- Convert your correlations to all be `$\geq 0$`: e.g., `$1 - |\rho|$` (which drops the sign) or `$1 - \rho$`

```r
cor_dist_matrix <- 1 - abs(exp_cor_matrix)
```

--
- Convert to distance matrix before using `hclust`

```r
cor_dist_matrix <- as.dist(cor_dist_matrix)
```

---

## Clustering variables using the correlation matrix

.pull-left[

- Cluster variables using `hclust()` as before!

- Use [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) to quickly visualize dendrogram

```r
library(ggdendro)
*nfl_exp_hc <- hclust(cor_dist_matrix,
                     "complete") 
*ggdendrogram(nfl_exp_hc,
*            rotate = TRUE,
*            size = 2)
```

]
.pull-right[

]

---

## Clustering variables using the correlation matrix

.pull-left[

- Another flexible option is [`dendextend`](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html)

```r
library(dendextend)
cor_dist_matrix %>%
  hclust() %>%
* as.dendrogram() %>%
  set("branches_k_col", 
      k = 2) %>% 
  set("labels_cex", .5) %>%
  ggplot(horiz = TRUE)
```

- Explore the [package documentation](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html) for more formatting

]
.pull-right[
<img src="12-Varsel_files/figure-html/unnamed-chunk-6-1.png" width="504" />

]

---

## Back to the response variable...

.pull-left[

Use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables

- __always look at your data__

- correlation values alone are not enough!

- what if a variable displayed a quadratic relationship?

```r
*library(GGally)
*ggpairs(nfl_model_data,
        columns =
          c("score_diff", "offense_ave_epa_pass",
            "offense_ave_epa_run", "defense_ave_epa_pass",
*           "defense_ave_epa_run")) +
  theme_bw()
```

]

.pull-right[
<img src="12-Varsel_files/figure-html/unnamed-chunk-7-1.png" width="504" />

]

---

### Do running statistics matter for modeling score differential?

--
Will use __5-fold cross-validation__ to assess how well different sets of variables (combinations of `pass` & `run` variables) perform in predicting `score_diff`?

--
Can initialize a column of the __test__ fold assignments to our dataset with the `sample()` function:

```r
set.seed(2020)
nfl_model_data <- nfl_model_data %>%
  mutate(test_fold = sample(rep(1:5, length.out = n())))
```

--
__Always remember to set your seed prior to any k-fold cross-validation!__

---

## Writing a function for k-fold cross-validation

```r
get_cv_preds <- function(model_formula, data = nfl_model_data) {
  # generate holdout predictions for every row based season
  map_dfr(unique(data$test_fold), 
          function(holdout) {
            # Separate test and training data:
            test_data <- data %>%
              filter(test_fold == holdout)
            train_data <- data %>%
              filter(test_fold != holdout)
            
            # Train model:
            reg_model <- lm(as.formula(model_formula), data = train_data)
            
            # Return tibble of holdout results:
            tibble(test_preds = predict(reg_model, newdata = test_data),
                   test_actual = test_data$score_diff,
                   test_fold = holdout) 
          })
}
```

---

## Function enables easy generation of holdout analysis

```r
all_cv_preds <- get_cv_preds("score_diff ~  offense_ave_epa_pass + offense_ave_epa_run + defense_ave_epa_pass + defense_ave_epa_run")
all_int_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass*offense_ave_epa_run + defense_ave_epa_pass*defense_ave_epa_run")
run_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_run + defense_ave_epa_run")
pass_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass + defense_ave_epa_pass")
off_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass + offense_ave_epa_run")
def_only_cv_preds <- get_cv_preds("score_diff ~ defense_ave_epa_pass + defense_ave_epa_run")
int_only_cv_preds <- get_cv_preds("score_diff ~ 1")
```

--
.pull-left[
Can then summarize together for a single plot:

```r
bind_rows(mutate(all_cv_preds, type = "All"),
          mutate(all_int_cv_preds, type = "All w/ interactions"),
          mutate(pass_only_cv_preds, type = "Passing only"),
          mutate(run_only_cv_preds, type = "Running only"),
          mutate(off_only_cv_preds, type = "Offense only"),
          mutate(def_only_cv_preds, type = "Defense only"),
          mutate(int_only_cv_preds, type = "Intercept-only")) %>%
  group_by(type) %>%
  summarize(rmse = sqrt(mean((test_actual - test_preds)^2))) %>%
  mutate(type = fct_reorder(type, rmse)) %>%
  ggplot(aes(x = type, y = rmse)) +
  geom_point() + coord_flip() + theme_bw()
```
]
.pull-right[
<img src="12-Varsel_files/figure-html/unnamed-chunk-10-1.png" width="504" />

]

---

## Fit selected model on all data and view summary

```r
all_lm <- lm(score_diff ~ offense_ave_epa_pass + offense_ave_epa_run + defense_ave_epa_pass + defense_ave_epa_run, data = nfl_model_data)
summary(all_lm)
```

```
## 
## Call:
## lm(formula = score_diff ~ offense_ave_epa_pass + offense_ave_epa_run + 
##     defense_ave_epa_pass + defense_ave_epa_run, data = nfl_model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.142 -18.394   0.024  18.680  92.412 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.378      1.648    2.05   0.0407 *  
## offense_ave_epa_pass  463.221      8.529   54.31   <2e-16 ***
## offense_ave_epa_run   336.067     15.283   21.99   <2e-16 ***
## defense_ave_epa_pass -480.406     10.909  -44.04   <2e-16 ***
## defense_ave_epa_run  -302.841     15.883  -19.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.67 on 728 degrees of freedom
## Multiple R-squared:  0.9263,	Adjusted R-squared:  0.9259 
## F-statistic:  2288 on 4 and 728 DF,  p-value: < 2.2e-16
```

---

## Do NOT show that summary in a presentation!

.pull-left[

- We can instead display a __coefficient plot__ with confidence intervals based on the reported standard errors

- Use the [`ggcoef()`](https://ggobi.github.io/ggally/articles/ggcoef.html) function from `GGally`

```r
ggcoef(all_lm, 
       exclude_intercept = TRUE,
       vline = TRUE,
       vline_color = "red") + 
  theme_bw()
```

- [__A well formatted table__](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html) of the summary output is appropriate for a report (not for a presentation)

]
.pull-right[

]