Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Variable selection
### Ron Yurko
### 06/22/2020

---

## The setting

We wish to learn a linear model. Our estimate (denoted by hats) is
$$
\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p
$$

--
Why would we attempt to select a __subset__ of the `$p$` variables?

--
- *To improve prediction accuracy* 
  - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias

--
- *To improve model interpretability*
  - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response.

---

## Best subset selection

- Start with the __null model__ `$\mathcal{M}_0$` (intercept-only) that has no predictors
  - just predicts the sample mean for each observation

--
- For `$k = 1, 2, \dots, p$` (each possible number of predictors)

- Fit __all__ `$\binom{p}{k} = \frac{p!}{k!(p-k)!}$` with exactly `$k$` predictors
  
  - Pick the best (some criteria) among these `$\binom{p}{k}$` models, call it `$\mathcal{M}_k$`
    - Best can be up to the user: cross-validation error, highest adjusted `$R^2$`, etc.

--
- Select a single best model from among `$\mathcal{M}_0, \dots, \mathcal{M}_p$`

--
__This is not typically used in research!__

- only practical for a smaller number of variables

--
- arbitrary way of defining __best__ and ignores __prior knowledge__ about potential predictors

---

## Use the shoe leather approach

[Prof. David Freeman](https://en.wikipedia.org/wiki/David_A._Freedman):

- algorithms can be tempting but they are NOT substitutes!

- you should NOT avoid the hard work of EDA in your modeling efforts

__Variable selection is a difficult problem!__

- Like much of a statistics & data science research there is not one unique, correct answer

--
You should justify which predictors / variables used in modeling:

- __based on extensive EDA__ and

- model assessment __based on test set predictions__

---

## Covariance and correlation

- __Covariance__ is a measure of the __linear__ dependence between two variables

- To be _"uncorrelated"_ is not the same as to be _"independent"_...
  
  - Independence means __there is no dependence__, linear or otherwise

--
- __Correlation__ is a _normalized_ form of variance, ranges from -1 through 0 to 1

- -1 means one variable linearly decreases absolutely in value while the other increases in value
  
  - 0 means no linear dependence
  
  - 1 means one variable linear increases absolutely while the other increases

--
- We can use the `cov()` / `cor()` functions in `R` to generate the __covariance__ / __correlation__ matrices

---

## Example data: NFL teams summary

Created dataset using [`nflscrapR-data`](https://github.com/ryurko/nflscrapR-data) summarizing NFL team performances from 2009 to 2019

```r
library(tidyverse)
nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/eda_projects/nfl_teams_season_summary.csv")
nfl_teams_data
```

```
## # A tibble: 352 x 24
##    team  season points_scored points_allowed  wins losses  ties pass_off_epa_pe…
##    <chr>  <dbl>         <dbl>          <dbl> <dbl>  <dbl> <dbl>            <dbl>
##  1 PIT     2009           368            324     9      7     0           0.119 
##  2 PIT     2010           375            232    12      4     0           0.161 
##  3 PIT     2011           325            227    12      4     0           0.115 
##  4 PIT     2012           336            314     8      8     0           0.0491
##  5 PIT     2013           379            370     8      8     0           0.102 
##  6 PIT     2014           436            368    11      5     0           0.244 
##  7 PIT     2015           423            319    10      6     0           0.108 
##  8 PIT     2016           399            327    11      5     0           0.166 
##  9 PIT     2017           406            308    13      3     0           0.182 
## 10 PIT     2018           428            360     9      6     1           0.179 
## # … with 342 more rows, and 16 more variables: pass_off_total_epa <dbl>,
## #   pass_off_total_yards_gained <dbl>, pass_off_yards_gained_per_att <dbl>,
## #   run_off_epa_per_att <dbl>, run_off_total_epa <dbl>, run_off_total_yards_gained <dbl>,
## #   run_off_yards_gained_per_att <dbl>, pass_def_epa_per_att <dbl>,
## #   pass_def_total_epa <dbl>, pass_def_total_yards_allowed <dbl>,
## #   pass_def_yards_allowed_per_att <dbl>, run_def_epa_per_att <dbl>,
## #   run_def_total_epa <dbl>, run_def_total_yards_allowed <dbl>,
## #   run_def_yards_allowed_per_att <dbl>, division <chr>
```

---

## Modeling NFL score differential

.pull-left[

Interested in modeling a team's __score differential__

```r
nfl_teams_data <- nfl_teams_data %>%
  mutate(score_diff = 
           points_scored - points_allowed)
nfl_teams_data %>%
  ggplot(aes(x = score_diff)) +
  geom_histogram(color = "black", 
                 fill = "darkblue",
                 alpha = 0.3) +
  theme_bw() +
  labs(x = "Score differential")
```

- Distribution is roughly symmetric and unimodal

- Supports assumption of normal distribution

- no transformation necessary

]
.pull-right[

]

---

## Correlation matrix of score differential and candidate predictors

.pull-left[

- Interested in `score_diff` relationships with team passing and rush statistics

- View the correlation matrix with [`ggcorrplot`](https://rpkgs.datanovia.com/ggcorrplot/)

```r
library(ggcorrplot)
nfl_model_data <- nfl_teams_data %>%
  dplyr::select(score_diff, pass_off_epa_per_att,
                run_off_epa_per_att, 
                pass_def_epa_per_att,
                run_def_epa_per_att,
                pass_off_yards_gained_per_att,
                run_off_yards_gained_per_att,
                pass_def_yards_allowed_per_att,
                run_def_yards_allowed_per_att)
*nfl_cor_matrix <- cor(nfl_model_data)
*ggcorrplot(nfl_cor_matrix)
```

]
.pull-right[
<img src="12-varsel_files/figure-html/unnamed-chunk-3-1.png" width="504" />

]

---

## Customize the appearance of the correlation matrix

.pull-left[
- Avoid redundancy by only using one half of matrix with `type`

- Add correlation value labels using `lab` (but round first!)

- Can arrange variables based on clustering...

```r
round_cor_matrix <- 
* round(cor(nfl_model_data), 2)
ggcorrplot(round_cor_matrix, 
*          hc.order = TRUE,
*          type = "lower",
*          lab = TRUE)
```

]
.pull-right[

<img src="12-varsel_files/figure-html/unnamed-chunk-4-1.png" width="504" />
]

---

## Clustering variables using the correlation matrix

We can apply [hierarchical clustering](http://www.stat.cmu.edu/cmsac/sure/materials/lectures/slides/06-intro-clustering.html#26) to the variables instead of observations

--
- Select the explanatory variables of interest from our data

```r
nfl_ex_vars <- dplyr::select(nfl_model_data, -score_diff)
```

--
- Compute correlation matrix of these variables:

```r
exp_cor_matrix <- cor(nfl_ex_vars)
```

--
- Correlations measure similarity and can be negative, while distances measure dissimilarity and cannot be negative.  As such, convert your correlations to instead be one minus the absolute value of the correlations, so that correlations near 1 or -1 will have distances of 0, and correlations near 0 will have distances of 1

```r
cor_dist_matrix <- 1 - abs(exp_cor_matrix)
```

--
- Convert to distance matrix before using `hclust`

```r
cor_dist_matrix <- as.dist(cor_dist_matrix)
```

---

## Clustering variables using the correlation matrix

.pull-left[

- Cluster variables using `hclust()` as before!

- Use [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) to quickly visualize dendrogram

```r
library(ggdendro)
*nfl_exp_hc <- hclust(cor_dist_matrix,
                     "complete") 
*ggdendrogram(nfl_exp_hc,
*            rotate = TRUE,
*            size = 2)
```

]
.pull-right[

]

---

## Clustering variables using the correlation matrix

.pull-left[

- Another flexible option is [`dendextend`](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html)

```r
library(dendextend)
cor_dist_matrix %>%
  hclust() %>%
  as.dendrogram() %>%
  set("branches_k_col", 
      k = 2) %>% 
  set("labels_cex", .9) %>%
  ggplot(horiz = TRUE)
```

]
.pull-right[
<img src="12-varsel_files/figure-html/unnamed-chunk-6-1.png" width="504" />

]

---

## Back to the response variable...

.pull-left[

Reminder: use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables

- __always look at your data__

- correlation values alone are not enough!

- what if a variable displayed a quadratic relationship?

```r
*library(GGally)
*ggpairs(nfl_model_data,
        columns =
          c("score_diff", "pass_off_epa_per_att",
            "run_off_epa_per_att", "pass_def_epa_per_att",
*           "run_def_epa_per_att"))
```

]

.pull-right[
<img src="12-varsel_files/figure-html/unnamed-chunk-7-1.png" width="504" />

]

---

## Do running statistics matter for modeling score differential?

--
Will use __5-fold cross-validation__ to assess how well different sets of variables (combinations of `pass` & `run` variables) perform in predicting `score_diff`?

--
Can initialize a column of the __test__ fold assignments to our dataset with the `sample()` function:

```r
set.seed(2020)
nfl_model_data <- nfl_model_data %>%
  mutate(test_fold = sample(rep(1:5, length.out = n())))
```

--
__Always remember to set your seed prior to any k-fold cross-validation!__

---

## Writing a function for k-fold cross-validation

```r
get_cv_preds <- function(model_formula, data = nfl_model_data) {
  # generate holdout predictions for every row based season
  map_dfr(unique(data$test_fold), 
          function(holdout) {
            # Separate test and training data:
            test_data <- data %>%
              filter(test_fold == holdout)
            train_data <- data %>%
              filter(test_fold != holdout)
            
            # Train model:
            reg_model <- lm(as.formula(model_formula), data = train_data)
            
            # Return tibble of holdout results:
            tibble(test_preds = predict(reg_model, newdata = test_data),
                   test_actual = test_data$score_diff,
                   test_fold = holdout) 
          })
}
```

---

## Function enables easy generation of holdout analysis

Can get test data predictions using the same folds for different model formulas:

```r
all_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + pass_def_epa_per_att + run_def_epa_per_att")
all_int_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att*run_off_epa_per_att + pass_def_epa_per_att*run_def_epa_per_att")
run_only_cv_preds <- get_cv_preds("score_diff ~ run_off_epa_per_att + run_def_epa_per_att")
pass_only_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + pass_def_epa_per_att")
off_only_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + run_off_epa_per_att")
def_only_cv_preds <- get_cv_preds("score_diff ~ pass_def_epa_per_att + run_def_epa_per_att")
```

--
.pull-left[
Can then summarize together for a single plot:

```r
bind_rows(mutate(all_cv_preds, type = "All"),
          mutate(all_int_cv_preds, type = "All w/ interactions"),
          mutate(pass_only_cv_preds, type = "Passing only"),
          mutate(run_only_cv_preds, type = "Running only"),
          mutate(off_only_cv_preds, type = "Offense only"),
          mutate(def_only_cv_preds, type = "Defense only")) %>%
  group_by(type) %>%
  summarize(rmse = sqrt(mean((test_actual - test_preds)^2))) %>%
  mutate(type = fct_reorder(type, rmse)) %>%
  ggplot(aes(x = type, y = rmse)) +
  geom_point() + coord_flip() + theme_bw()
```
]
.pull-right[
<img src="12-varsel_files/figure-html/unnamed-chunk-10-1.png" width="504" />

]

---

## Fit selected model on all data and view summary

```r
all_lm <- lm(score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + pass_def_epa_per_att + run_def_epa_per_att, data = nfl_model_data)
summary(all_lm)
```

```
## 
## Call:
## lm(formula = score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + 
##     pass_def_epa_per_att + run_def_epa_per_att, data = nfl_model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.723 -22.463  -1.139  19.980  99.841 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.918      2.460   1.186    0.236    
## pass_off_epa_per_att  485.822     12.151  39.981   <2e-16 ***
## run_off_epa_per_att   315.772     22.678  13.924   <2e-16 ***
## pass_def_epa_per_att -473.925     16.078 -29.476   <2e-16 ***
## run_def_epa_per_att  -282.359     24.132 -11.701   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.87 on 347 degrees of freedom
## Multiple R-squared:  0.9201,	Adjusted R-squared:  0.9192 
## F-statistic: 998.6 on 4 and 347 DF,  p-value: < 2.2e-16
```

---

## Do NOT show that summary in a presentation!

.pull-left[

- We can instead a coefficient plot with their respective confidence intervals based on the reported standard errors

- Use the [`ggcoef()`](https://ggobi.github.io/ggally/articles/ggcoef.html) function from `GGally`

```r
ggcoef(all_lm, 
       exclude_intercept = TRUE,
       vline = TRUE,
       vline_color = "red") + 
  theme_bw()
```

- [__A well formatted table__](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html) of the summary output is appropriate for a report (not for a presentation)

]
.pull-right[

]