Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Principal component regression and partial least squares
### July 7th, 2022

---

## Principal component regression (PCR)

---

## Example data: NFL teams summary

Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021

```r
library(tidyverse)
nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv")
nfl_model_data <- nfl_teams_data %>%
  mutate(score_diff = points_scored - points_allowed) %>%
  # Only use rows with air yards
  filter(season >= 2006) %>%
  dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team)
nfl_model_data
```

```
## # A tibble: 512 x 49
##    offense_completi… offense_total_y… offense_total_y… offense_ave_yar… offense_ave_yar…
##                <dbl>            <dbl>            <dbl>            <dbl>            <dbl>
##  1             0.561             3662             1350             6.40             3.28
##  2             0.480             2371             2946             5.10             5.56
##  3             0.612             3435             1667             6.41             3.74
##  4             0.564             2718             1555             5.70             3.73
##  5             0.569             3264             1674             5.72             4.10
##  6             0.525             3286             1940             6.12             4.02
##  7             0.588             3827             1648             6.88             3.91
##  8             0.565             2893             1347             5.16             3.69
##  9             0.569             3838             1954             7.04             4.23
## 10             0.529             2799             2167             5.78             4.58
## # … with 502 more rows, and 44 more variables: offense_total_air_yards <dbl>,
## #   offense_ave_air_yards <dbl>, offense_total_yac <dbl>, offense_ave_yac <dbl>,
## #   offense_n_plays_pass <dbl>, offense_n_plays_run <dbl>,
## #   offense_n_interceptions <dbl>, offense_n_fumbles_lost_pass <dbl>,
## #   offense_n_fumbles_lost_run <dbl>, offense_total_epa_pass <dbl>,
## #   offense_total_epa_run <dbl>, offense_ave_epa_pass <dbl>, offense_ave_epa_run <dbl>,
## #   offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>,
## #   offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>,
## #   offense_success_rate_pass <dbl>, offense_success_rate_run <dbl>,
## #   defense_completion_percentage <dbl>, defense_total_yards_gained_pass <dbl>,
## #   defense_total_yards_gained_run <dbl>, defense_ave_yards_gained_pass <dbl>,
## #   defense_ave_yards_gained_run <dbl>, defense_total_air_yards <dbl>,
## #   defense_ave_air_yards <dbl>, defense_total_yac <dbl>, defense_ave_yac <dbl>,
## #   defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>,
## #   defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>,
## #   defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>,
## #   defense_total_epa_run <dbl>, defense_ave_epa_pass <dbl>, defense_ave_epa_run <dbl>,
## #   defense_total_wpa_pass <dbl>, defense_total_wpa_run <dbl>,
## #   defense_ave_wpa_pass <dbl>, defense_ave_wpa_run <dbl>,
## #   defense_success_rate_pass <dbl>, defense_success_rate_run <dbl>, score_diff <dbl>
```

---

## Implement PCR with [`pls` package](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

Similar syntax to `lm` formula but specify the number of PCs (`ncomp`)

```r
library(pls)
nfl_pcr_fit <- pcr(score_diff ~ ., ncomp = 2, scale = TRUE, data = nfl_model_data)
summary(nfl_pcr_fit)
```

```
## Data: 	X dimension: 512 48 
## 	Y dimension: 512 1
## Fit method: svdpc
## Number of components considered: 2
## TRAINING: % variance explained
##             1 comps  2 comps
## X             21.41    41.47
## score_diff    80.84    87.86
```

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

To perform PCR __we need to tune the number of principal components__

.pull-left[

- Tune # components in PCR with [`caret`](http://topepo.github.io/caret/index.html)

- `train` with 10-fold CV using `pcr` from [`pls`](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf)

```r
set.seed(2013)
library(caret)
cv_model_pcr <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pcr",
  trControl = trainControl(method = "cv", number = 10),
* preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
ggplot(cv_model_pcr) + theme_bw()
```

]
.pull-right[

]

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

By default returns model with minimum CV error as `finalModel`

```r
summary(cv_model_pcr$finalModel)
```

```
## Data: 	X dimension: 512 48 
## 	Y dimension: 512 1
## Fit method: svdpc
## Number of components considered: 37
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X           21.41    41.47    52.74    61.47    66.30     70.4    74.24    77.56
## .outcome    80.84    87.86    88.09    89.82    89.82     90.1    90.21    90.38
##           9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps
## X           80.31     82.82     84.96     86.86     88.69      90.2     91.61     92.86
## .outcome    90.38     90.79     90.97     91.20     91.22      91.3     91.31     91.44
##           17 comps  18 comps  19 comps  20 comps  21 comps  22 comps  23 comps
## X            93.76     94.61     95.36     96.06     96.74     97.28     97.75
## .outcome     91.44     91.46     91.46     91.63     91.63     91.83     91.84
##           24 comps  25 comps  26 comps  27 comps  28 comps  29 comps  30 comps
## X            98.20     98.56     98.83     99.06     99.28     99.47     99.61
## .outcome     92.16     92.16     92.83     92.88     92.92     93.55     93.63
##           31 comps  32 comps  33 comps  34 comps  35 comps  36 comps  37 comps
## X            99.74     99.85     99.88     99.91     99.93     99.95     99.97
## .outcome     93.90     94.61     94.61     94.65     94.69     94.84     94.86
```

---

## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html)

Modify `selectionFunction` in `train` to be the `oneSE` rule

.pull-left[

```r
set.seed(2013)
cv_model_pcr_onese <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pcr",
  trControl = 
    trainControl(method = "cv", number = 10,
*                selectionFunction = "oneSE"),
  preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
```

]
.pull-right[

```r
summary(cv_model_pcr_onese$finalModel)
```

```
## Data: 	X dimension: 512 48 
## 	Y dimension: 512 1
## Fit method: svdpc
## Number of components considered: 32
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X           21.41    41.47    52.74    61.47    66.30     70.4    74.24    77.56
## .outcome    80.84    87.86    88.09    89.82    89.82     90.1    90.21    90.38
##           9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps
## X           80.31     82.82     84.96     86.86     88.69      90.2     91.61     92.86
## .outcome    90.38     90.79     90.97     91.20     91.22      91.3     91.31     91.44
##           17 comps  18 comps  19 comps  20 comps  21 comps  22 comps  23 comps
## X            93.76     94.61     95.36     96.06     96.74     97.28     97.75
## .outcome     91.44     91.46     91.46     91.63     91.63     91.83     91.84
##           24 comps  25 comps  26 comps  27 comps  28 comps  29 comps  30 comps
## X            98.20     98.56     98.83     99.06     99.28     99.47     99.61
## .outcome     92.16     92.16     92.83     92.88     92.92     93.55     93.63
##           31 comps  32 comps
## X            99.74     99.85
## .outcome     93.90     94.61
```

]

---

## Partial least squares (PLS)

__PCR is agnostic of response variable__

---

## PLS as supervised dimension reduction

__First principal component__ in PCA:

`$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$`

In PLS we set `$\phi_{j1}$` to the coefficient from __simple linear regression__ of `$Y$` on each `$X_j$`

- Remember this slope is proportional to the correlation! `$\widehat{\beta}_{} = r_{X,Y} \cdot \frac{s_Y}{s_X}$`
  
  - Thus `$Z_1$` in PLS places most weight on variables strongly related to response `$Y$`

To compute `$Z_2$` for PLS:

- Regress each `$X_j$` on `$Z_1$`, residuals capture signal not explained by `$Z_1$`
  
  - Set `$\phi_{j2}$` to the coefficient from __simple linear regression__ of `$Y$` on these residuals for each variable
  
--

Repeat process until all `$Z_1, Z_2, \dots, Z_p$` are computed (__PLS components__)

Then regress `$Y$` on `$Z_1, Z_2, \dots, Z_p^*$`, where `$p^* < p$` is a tuning parameter

---

## Tuning PLS with [`caret`](http://topepo.github.io/caret/index.html)

.pull-left[

```r
set.seed(2013)
cv_model_pls <- train(
  score_diff ~ ., 
  data = nfl_model_data, 
* method = "pls",
  trControl = 
    trainControl(method = "cv", number = 10,
                 selectionFunction = "oneSE"), 
  preProcess = c("center", "scale"),
  tuneLength = ncol(nfl_model_data) - 1)
ggplot(cv_model_pls) + theme_bw()
```

Sharp contrast with PCR results!

Fewer PLS components because they are guided by the response variable

]
.pull-right[

]

_But how do we summarize variable relationships without a single coefficient?_

---

## Variable importance with [`vip` package](https://cran.r-project.org/web/packages/vip/vignettes/vip-introduction.pdf)

__Variable importance__ attempts to quantify how influential variables are in the model

- e.g., absolute value of `$t$`-statistic in regression
  
--

__For PLS__:  weighted sums of the absolute regression coefficients across components

- Weights are function of reduction of RSS across the number of PLS components
  
--

.pull-left[

```r
# Check out `cv_model_pls$finalModel$coefficients`
library(vip)
*vip(cv_model_pls, num_features = 10,
*   method = "model") +
  theme_bw() 
```
]

.pull-right[
<img src="15-PCR-PLS_files/figure-html/unnamed-chunk-8-1.png" width="576" />
]

---

## Partial dependence plots (PDP) with [`pdp` package](https://bgreenwell.github.io/pdp/index.html)

PDPs display the change in the average predicted response as the predictor varies over their marginal distribution

- More useful for non-linear models later on!

```r
library(pdp)
*partial(cv_model_pls, "offense_total_epa_pass", plot = TRUE)
```