class: center, middle, inverse, title-slide # Supervised Learning ## Principal component regression and partial least squares ### July 6th, 2021 --- ## Principal component regression (PCR) <img src="https://bradleyboehmke.github.io/HOML/images/pcr-steps.png" width="50%" style="display: block; margin: auto;" /> --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2020 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/regression_projects/nfl_team_season_summary.csv") nfl_model_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) %>% # Only use rows with air yards filter(season >= 2006) %>% dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team) nfl_model_data ``` ``` ## # A tibble: 480 x 49 ## offense_completio… offense_total_yard… offense_total_yard… offense_ave_yard… offense_ave_yard… ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.561 3662 1350 6.40 3.28 ## 2 0.480 2371 2946 5.10 5.56 ## 3 0.612 3435 1667 6.41 3.74 ## 4 0.564 2718 1555 5.70 3.73 ## 5 0.569 3264 1674 5.72 4.10 ## 6 0.525 3286 1940 6.12 4.02 ## 7 0.588 3827 1648 6.88 3.91 ## 8 0.565 2893 1347 5.16 3.69 ## 9 0.569 3838 1954 7.04 4.23 ## 10 0.529 2799 2167 5.78 4.58 ## # … with 470 more rows, and 44 more variables: offense_total_air_yards <dbl>, ## # offense_ave_air_yards <dbl>, offense_total_yac <dbl>, offense_ave_yac <dbl>, ## # offense_n_plays_pass <dbl>, offense_n_plays_run <dbl>, offense_n_interceptions <dbl>, ## # offense_n_fumbles_lost_pass <dbl>, offense_n_fumbles_lost_run <dbl>, ## # offense_total_epa_pass <dbl>, offense_total_epa_run <dbl>, offense_ave_epa_pass <dbl>, ## # offense_ave_epa_run <dbl>, offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>, ## # offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>, offense_success_rate_pass <dbl>, ## # offense_success_rate_run <dbl>, defense_completion_percentage <dbl>, ## # defense_total_yards_gained_pass <dbl>, defense_total_yards_gained_run <dbl>, ## # defense_ave_yards_gained_pass <dbl>, defense_ave_yards_gained_run <dbl>, ## # defense_total_air_yards <dbl>, defense_ave_air_yards <dbl>, defense_total_yac <dbl>, ## # defense_ave_yac <dbl>, defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>, ## # defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>, ## # defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>, defense_total_epa_run <dbl>, ## # defense_ave_epa_pass <dbl>, defense_ave_epa_run <dbl>, defense_total_wpa_pass <dbl>, ## # defense_total_wpa_run <dbl>, defense_ave_wpa_pass <dbl>, defense_ave_wpa_run <dbl>, ## # defense_success_rate_pass <dbl>, defense_success_rate_run <dbl>, score_diff <dbl> ``` --- ## Implement PCR with [`pls` package](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf) Similar syntax to `lm` formula but specify the number of PCs (`ncomp`) ```r library(pls) nfl_pcr_fit <- pcr(score_diff ~ ., ncomp = 2, scale = TRUE, data = nfl_model_data) summary(nfl_pcr_fit) ``` ``` ## Data: X dimension: 480 48 ## Y dimension: 480 1 ## Fit method: svdpc ## Number of components considered: 2 ## TRAINING: % variance explained ## 1 comps 2 comps ## X 21.49 41.62 ## score_diff 84.01 87.49 ``` --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) To perform PCR __we need to tune the number of principal components__ .pull-left[ - Tune # components in PCR with [`caret`](http://topepo.github.io/caret/index.html) - `train` with 10-fold CV using `pcr` from [`pls`](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf) ```r set.seed(2013) library(caret) cv_model_pcr <- train( score_diff ~ ., data = nfl_model_data, * method = "pcr", trControl = trainControl(method = "cv", number = 10), * preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ggplot(cv_model_pcr) + theme_bw() ``` ] .pull-right[ <img src="17-PCR-PLS_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) By default returns model with minimum CV error as `finalModel` ```r summary(cv_model_pcr$finalModel) ``` ``` ## Data: X dimension: 480 48 ## Y dimension: 480 1 ## Fit method: svdpc ## Number of components considered: 43 ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps ## X 21.49 41.62 53.04 61.70 66.63 70.79 74.5 77.81 80.56 ## .outcome 84.01 87.49 87.75 89.69 89.69 90.06 90.2 90.31 90.32 ## 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps ## X 83.10 85.27 87.18 88.99 90.43 91.85 93.10 94.00 ## .outcome 90.66 90.82 91.03 91.05 91.12 91.15 91.28 91.28 ## 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps ## X 94.76 95.49 96.18 96.79 97.30 97.77 98.21 98.57 ## .outcome 91.28 91.28 91.34 91.71 91.73 91.80 91.95 91.96 ## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps ## X 98.84 99.07 99.29 99.47 99.61 99.74 99.85 99.88 ## .outcome 92.64 92.72 92.79 93.46 93.53 93.81 94.51 94.52 ## 34 comps 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps ## X 99.91 99.94 99.95 99.97 99.98 99.99 99.99 99.99 ## .outcome 94.55 94.61 94.79 94.80 94.82 94.82 94.83 94.83 ## 42 comps 43 comps ## X 99.99 100.00 ## .outcome 94.89 94.96 ``` --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) Modify `selectionFunction` in `train` to be the `oneSE` rule .pull-left[ ```r set.seed(2013) cv_model_pcr_onese <- train( score_diff ~ ., data = nfl_model_data, * method = "pcr", trControl = trainControl(method = "cv", number = 10, * selectionFunction = "oneSE"), preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ``` ] .pull-right[ ```r summary(cv_model_pcr_onese$finalModel) ``` ``` ## Data: X dimension: 480 48 ## Y dimension: 480 1 ## Fit method: svdpc ## Number of components considered: 32 ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 9 comps ## X 21.49 41.62 53.04 61.70 66.63 70.79 74.5 77.81 80.56 ## .outcome 84.01 87.49 87.75 89.69 89.69 90.06 90.2 90.31 90.32 ## 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps 17 comps ## X 83.10 85.27 87.18 88.99 90.43 91.85 93.10 94.00 ## .outcome 90.66 90.82 91.03 91.05 91.12 91.15 91.28 91.28 ## 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps 24 comps 25 comps ## X 94.76 95.49 96.18 96.79 97.30 97.77 98.21 98.57 ## .outcome 91.28 91.28 91.34 91.71 91.73 91.80 91.95 91.96 ## 26 comps 27 comps 28 comps 29 comps 30 comps 31 comps 32 comps ## X 98.84 99.07 99.29 99.47 99.61 99.74 99.85 ## .outcome 92.64 92.72 92.79 93.46 93.53 93.81 94.51 ``` ] --- ## Partial least squares (PLS) __PCR is agnostic of response variable__ <img src="https://bradleyboehmke.github.io/HOML/images/pls-vs-pcr.png" width="80%" style="display: block; margin: auto;" /> --- ## PLS as supervised dimension reduction __First principal component__ in PCA: `$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$` -- In PLS we set `\(\phi_{j1}\)` to the coefficient from __simple linear regression__ of `\(Y\)` on each `\(X_j\)` - Remember this slope is proportional to the correlation! `\(\widehat{\beta}_{} = r_{X,Y} \cdot \frac{s_Y}{s_X}\)` - Thus `\(Z_1\)` in PLS places most weight on variables strongly related to response `\(Y\)` -- To compute `\(Z_2\)` for PLS: - Regress each `\(X_j\)` on `\(Z_1\)`, residuals capture signal not explained by `\(Z_1\)` - Set `\(\phi_{j2}\)` to the coefficient from __simple linear regression__ of `\(Y\)` on these residuals for each variable -- Repeat process until all `\(Z_1, Z_2, \dots, Z_p\)` are computed (__PLS components__) Then regress `\(Y\)` on `\(Z_1, Z_2, \dots, Z_p^*\)`, where `\(p^* < p\)` is a tuning parameter --- ## Tuning PLS with [`caret`](http://topepo.github.io/caret/index.html) .pull-left[ ```r set.seed(2013) cv_model_pls <- train( score_diff ~ ., data = nfl_model_data, * method = "pls", trControl = trainControl(method = "cv", number = 10, selectionFunction = "oneSE"), preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ggplot(cv_model_pls) + theme_bw() ``` Sharp contrast with PCR results! Fewer PLS components because they are guided by the response variable ] .pull-right[ <img src="17-PCR-PLS_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" /> ] -- _But how do we summarize variable relationships without a single coefficient?_ --- ## Variable importance with [`vip` package](https://cran.r-project.org/web/packages/vip/vignettes/vip-introduction.pdf) __Variable importance__ attempts to quantify how influential variables are in the model - e.g., absolute value of `\(t\)`-statistic in regression -- __For PLS__: weighted sums of the absolute regression coefficients across components - Weights are function of reduction of RSS across the number of PLS components -- .pull-left[ ```r # Check out `cv_model_pls$finalModel$coefficients` library(vip) *vip(cv_model_pls, num_features = 10, * method = "model") + theme_bw() ``` ] .pull-right[ <img src="17-PCR-PLS_files/figure-html/unnamed-chunk-8-1.png" width="576" /> ] --- ## Partial dependence plots (PDP) with [`pdp` package](https://bgreenwell.github.io/pdp/index.html) PDPs display the change in the average predicted response as the predictor varies over their marginal distribution - More useful for non-linear models later on! ```r library(pdp) *partial(cv_model_pls, "offense_total_epa_pass", plot = TRUE) ``` <img src="17-PCR-PLS_files/figure-html/pdp-example-1.png" width="576" style="display: block; margin: auto;" />