Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Principal component regression and partial least squares
### July 6th, 2021

---

## Principal component regression (PCR)

---

## Example data: NFL teams summary

Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2020

```r
library(tidyverse)
nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/regression_projects/nfl_team_season_summary.csv")
nfl_model_data <- nfl_teams_data %>%
  mutate(score_diff = points_scored - points_allowed) %>%
  # Only use rows with air yards
  filter(season >= 2006) %>%
  dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team)
nfl_model_data
```

```
## # A tibble: 480 x 49
##    offense_completio… offense_total_yard… offense_total_yard… offense_ave_yard… offense_ave_yard…
##                 <dbl>               <dbl>               <dbl>             <dbl>             <dbl>
##  1              0.561                3662                1350              6.40              3.28
##  2              0.480                2371                2946              5.10              5.56
##  3              0.612                3435                1667              6.41              3.74
##  4              0.564                2718                1555              5.70              3.73
##  5              0.569                3264                1674              5.72              4.10
##  6              0.525                3286                1940              6.12              4.02
##  7              0.588                3827                1648              6.88              3.91
##  8              0.565                2893                1347              5.16              3.69
##  9              0.569                3838                1954              7.04              4.23
## 10              0.529                2799                2167              5.78              4.58
## # … with 470 more rows, and 44 more variables: offense_total_air_yards <dbl>,
## #   offense_ave_air_yards <dbl>, offense_total_yac <dbl>, offense_ave_yac <dbl>,
## #   offense_n_plays_pass <dbl>, offense_n_plays_run <dbl>, offense_n_interceptions <dbl>,
## #   offense_n_fumbles_lost_pass <dbl>, offense_n_fumbles_lost_run <dbl>,
## #   offense_total_epa_pass <dbl>, offense_total_epa_run <dbl>, offense_ave_epa_pass <dbl>,
## #   offense_ave_epa_run <dbl>, offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>,
## #   offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>, offense_success_rate_pass <dbl>,
## #   offense_success_rate_run <dbl>, defense_completion_percentage <dbl>,
## #   defense_total_yards_gained_pass <dbl>, defense_total_yards_gained_run <dbl>,
## #   defense_ave_yards_gained_pass <dbl>, defense_ave_yards_gained_run <dbl>,
## #   defense_total_air_yards <dbl>, defense_ave_air_yards <dbl>, defense_total_yac <dbl>,
## #   defense_ave_yac <dbl>, defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>,
## #   defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>,
## #   defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>, defense_total_epa_run <dbl>,
## #   defense_ave_epa_pass <dbl>, defense_ave_epa_run <dbl>, defense_total_wpa_pass <dbl>,
## #   defense_total_wpa_run <dbl>, defense_ave_wpa_pass <dbl>, defense_ave_wpa_run <dbl>,
## #   defense_success_rate_pass <dbl>, defense_success_rate_run <dbl>, score_diff <dbl>
```

---

## Implement PCR with [`pls` package](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

Similar syntax to `lm` formula but specify the number of PCs (`ncomp`)

```r
library(pls)
nfl_pcr_fit <- pcr(score_diff ~ ., ncomp = 2, scale = TRUE, data = nfl_model_data)
summary(nfl_pcr_fit)
```

```
## Data: 	X dimension: 480 48 
## 	Y dimension: 480 1
## Fit method: svdpc
## Number of components considered: 2
## TRAINING: % variance explained
##             1 comps  2 comps
## X             21.49    41.62
## score_diff    84.01    87.49
```

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

To perform PCR __we need to tune the number of principal components__

.pull-left[

- Tune # components in PCR with [`caret`](http://topepo.github.io/caret/index.html)

- `train` with 10-fold CV using `pcr` from [`pls`](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

```r
set.seed(2013)
library(caret)
cv_model_pcr <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pcr",
  trControl = trainControl(method = "cv", number = 10),
* preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
ggplot(cv_model_pcr) + theme_bw()
```

]
.pull-right[

]

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

By default returns model with minimum CV error as `finalModel`

```r
summary(cv_model_pcr$finalModel)
```

```
## Data: 	X dimension: 480 48 
## 	Y dimension: 480 1
## Fit method: svdpc
## Number of components considered: 43
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
## X           21.49    41.62    53.04    61.70    66.63    70.79     74.5    77.81    80.56
## .outcome    84.01    87.49    87.75    89.69    89.69    90.06     90.2    90.31    90.32
##           10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps
## X            83.10     85.27     87.18     88.99     90.43     91.85     93.10     94.00
## .outcome     90.66     90.82     91.03     91.05     91.12     91.15     91.28     91.28
##           18 comps  19 comps  20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## X            94.76     95.49     96.18     96.79     97.30     97.77     98.21     98.57
## .outcome     91.28     91.28     91.34     91.71     91.73     91.80     91.95     91.96
##           26 comps  27 comps  28 comps  29 comps  30 comps  31 comps  32 comps  33 comps
## X            98.84     99.07     99.29     99.47     99.61     99.74     99.85     99.88
## .outcome     92.64     92.72     92.79     93.46     93.53     93.81     94.51     94.52
##           34 comps  35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps
## X            99.91     99.94     99.95     99.97     99.98     99.99     99.99     99.99
## .outcome     94.55     94.61     94.79     94.80     94.82     94.82     94.83     94.83
##           42 comps  43 comps
## X            99.99    100.00
## .outcome     94.89     94.96
```

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

Modify `selectionFunction` in `train` to be the `oneSE` rule

.pull-left[

```r
set.seed(2013)
cv_model_pcr_onese <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pcr",
  trControl = 
    trainControl(method = "cv", number = 10,
*                selectionFunction = "oneSE"),
  preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
```

]
.pull-right[

```r
summary(cv_model_pcr_onese$finalModel)
```

```
## Data: 	X dimension: 480 48 
## 	Y dimension: 480 1
## Fit method: svdpc
## Number of components considered: 32
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
## X           21.49    41.62    53.04    61.70    66.63    70.79     74.5    77.81    80.56
## .outcome    84.01    87.49    87.75    89.69    89.69    90.06     90.2    90.31    90.32
##           10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps
## X            83.10     85.27     87.18     88.99     90.43     91.85     93.10     94.00
## .outcome     90.66     90.82     91.03     91.05     91.12     91.15     91.28     91.28
##           18 comps  19 comps  20 comps  21 comps  22 comps  23 comps  24 comps  25 comps
## X            94.76     95.49     96.18     96.79     97.30     97.77     98.21     98.57
## .outcome     91.28     91.28     91.34     91.71     91.73     91.80     91.95     91.96
##           26 comps  27 comps  28 comps  29 comps  30 comps  31 comps  32 comps
## X            98.84     99.07     99.29     99.47     99.61     99.74     99.85
## .outcome     92.64     92.72     92.79     93.46     93.53     93.81     94.51
```

]

---

## Partial least squares (PLS)

__PCR is agnostic of response variable__

---

## PLS as supervised dimension reduction

__First principal component__ in PCA:

`$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$`

In PLS we set `$\phi_{j1}$` to the coefficient from __simple linear regression__ of `$Y$` on each `$X_j$`

- Remember this slope is proportional to the correlation! `$\widehat{\beta}_{} = r_{X,Y} \cdot \frac{s_Y}{s_X}$`
  
  - Thus `$Z_1$` in PLS places most weight on variables strongly related to response `$Y$`

To compute `$Z_2$` for PLS:

- Regress each `$X_j$` on `$Z_1$`, residuals capture signal not explained by `$Z_1$`
  
  - Set `$\phi_{j2}$` to the coefficient from __simple linear regression__ of `$Y$` on these residuals for each variable
  
--

Repeat process until all `$Z_1, Z_2, \dots, Z_p$` are computed (__PLS components__)

Then regress `$Y$` on `$Z_1, Z_2, \dots, Z_p^*$`, where `$p^* < p$` is a tuning parameter

---

## Tuning PLS with [`caret`](http://topepo.github.io/caret/index.html)

.pull-left[

```r
set.seed(2013)
cv_model_pls <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pls",
  trControl = 
    trainControl(method = "cv", number = 10,
                 selectionFunction = "oneSE"), 
  preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
ggplot(cv_model_pls) + theme_bw()
```

Sharp contrast with PCR results!

Fewer PLS components because they are guided by the response variable

]
.pull-right[

]

_But how do we summarize variable relationships without a single coefficient?_

---

## Variable importance with [`vip` package](https://cran.r-project.org/web/packages/vip/vignettes/vip-introduction.pdf)

__Variable importance__ attempts to quantify how influential variables are in the model

- e.g., absolute value of `$t$`-statistic in regression
  
--

__For PLS__:  weighted sums of the absolute regression coefficients across components

- Weights are function of reduction of RSS across the number of PLS components
  
--

.pull-left[

```r
# Check out `cv_model_pls$finalModel$coefficients`
library(vip)
*vip(cv_model_pls, num_features = 10,
*   method = "model") +
  theme_bw() 
```
]

.pull-right[
<img src="17-PCR-PLS_files/figure-html/unnamed-chunk-8-1.png" width="576" />
]

---

## Partial dependence plots (PDP) with [`pdp` package](https://bgreenwell.github.io/pdp/index.html)

PDPs display the change in the average predicted response as the predictor varies over their marginal distribution

- More useful for non-linear models later on!

```r
library(pdp)
*partial(cv_model_pls, "offense_total_epa_pass", plot = TRUE)
```