class: center, middle, inverse, title-slide # Supervised Learning ## Principal component regression and partial least squares ### July 7th, 2022 --- ## Principal component regression (PCR) <img src="https://bradleyboehmke.github.io/HOML/images/pcr-steps.png" width="50%" style="display: block; margin: auto;" /> --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv") nfl_model_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) %>% # Only use rows with air yards filter(season >= 2006) %>% dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team) nfl_model_data ``` ``` ## # A tibble: 512 x 49 ## offense_completi… offense_total_y… offense_total_y… offense_ave_yar… offense_ave_yar… ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.561 3662 1350 6.40 3.28 ## 2 0.480 2371 2946 5.10 5.56 ## 3 0.612 3435 1667 6.41 3.74 ## 4 0.564 2718 1555 5.70 3.73 ## 5 0.569 3264 1674 5.72 4.10 ## 6 0.525 3286 1940 6.12 4.02 ## 7 0.588 3827 1648 6.88 3.91 ## 8 0.565 2893 1347 5.16 3.69 ## 9 0.569 3838 1954 7.04 4.23 ## 10 0.529 2799 2167 5.78 4.58 ## # … with 502 more rows, and 44 more variables: offense_total_air_yards <dbl>, ## # offense_ave_air_yards <dbl>, offense_total_yac <dbl>, offense_ave_yac <dbl>, ## # offense_n_plays_pass <dbl>, offense_n_plays_run <dbl>, ## # offense_n_interceptions <dbl>, offense_n_fumbles_lost_pass <dbl>, ## # offense_n_fumbles_lost_run <dbl>, offense_total_epa_pass <dbl>, ## # offense_total_epa_run <dbl>, offense_ave_epa_pass <dbl>, offense_ave_epa_run <dbl>, ## # offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>, ## # offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>, ## # offense_success_rate_pass <dbl>, offense_success_rate_run <dbl>, ## # defense_completion_percentage <dbl>, defense_total_yards_gained_pass <dbl>, ## # defense_total_yards_gained_run <dbl>, defense_ave_yards_gained_pass <dbl>, ## # defense_ave_yards_gained_run <dbl>, defense_total_air_yards <dbl>, ## # defense_ave_air_yards <dbl>, defense_total_yac <dbl>, defense_ave_yac <dbl>, ## # defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>, ## # defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>, ## # defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>, ## # defense_total_epa_run <dbl>, defense_ave_epa_pass <dbl>, defense_ave_epa_run <dbl>, ## # defense_total_wpa_pass <dbl>, defense_total_wpa_run <dbl>, ## # defense_ave_wpa_pass <dbl>, defense_ave_wpa_run <dbl>, ## # defense_success_rate_pass <dbl>, defense_success_rate_run <dbl>, score_diff <dbl> ``` --- ## Implement PCR with [`pls` package](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf) Similar syntax to `lm` formula but specify the number of PCs (`ncomp`) ```r library(pls) nfl_pcr_fit <- pcr(score_diff ~ ., ncomp = 2, scale = TRUE, data = nfl_model_data) summary(nfl_pcr_fit) ``` ``` ## Data: X dimension: 512 48 ## Y dimension: 512 1 ## Fit method: svdpc ## Number of components considered: 2 ## TRAINING: % variance explained ## 1 comps 2 comps ## X 21.41 41.47 ## score_diff 80.84 87.86 ``` --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) To perform PCR __we need to tune the number of principal components__ .pull-left[ - Tune # components in PCR with [`caret`](http://topepo.github.io/caret/index.html) - `train` with 10-fold CV using `pcr` from [`pls`](https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf) ```r set.seed(2013) library(caret) cv_model_pcr <- train( score_diff ~ ., data = nfl_model_data, * method = "pcr", trControl = trainControl(method = "cv", number = 10), * preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ggplot(cv_model_pcr) + theme_bw() ``` ] .pull-right[ <img src="15-PCR-PLS_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) By default returns model with minimum CV error as `finalModel` ```r summary(cv_model_pcr$finalModel) ``` ``` ## Data: X dimension: 512 48 ## Y dimension: 512 1 ## Fit method: svdpc ## Number of components considered: 37 ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps ## X 21.41 41.47 52.74 61.47 66.30 70.4 74.24 77.56 ## .outcome 80.84 87.86 88.09 89.82 89.82 90.1 90.21 90.38 ## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps ## X 80.31 82.82 84.96 86.86 88.69 90.2 91.61 92.86 ## .outcome 90.38 90.79 90.97 91.20 91.22 91.3 91.31 91.44 ## 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps ## X 93.76 94.61 95.36 96.06 96.74 97.28 97.75 ## .outcome 91.44 91.46 91.46 91.63 91.63 91.83 91.84 ## 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps 30 comps ## X 98.20 98.56 98.83 99.06 99.28 99.47 99.61 ## .outcome 92.16 92.16 92.83 92.88 92.92 93.55 93.63 ## 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps 37 comps ## X 99.74 99.85 99.88 99.91 99.93 99.95 99.97 ## .outcome 93.90 94.61 94.61 94.65 94.69 94.84 94.86 ``` --- ## Tuning PCR with [`caret`](http://topepo.github.io/caret/index.html) Modify `selectionFunction` in `train` to be the `oneSE` rule .pull-left[ ```r set.seed(2013) cv_model_pcr_onese <- train( score_diff ~ ., data = nfl_model_data, * method = "pcr", trControl = trainControl(method = "cv", number = 10, * selectionFunction = "oneSE"), preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ``` ] .pull-right[ ```r summary(cv_model_pcr_onese$finalModel) ``` ``` ## Data: X dimension: 512 48 ## Y dimension: 512 1 ## Fit method: svdpc ## Number of components considered: 32 ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps ## X 21.41 41.47 52.74 61.47 66.30 70.4 74.24 77.56 ## .outcome 80.84 87.86 88.09 89.82 89.82 90.1 90.21 90.38 ## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps 16 comps ## X 80.31 82.82 84.96 86.86 88.69 90.2 91.61 92.86 ## .outcome 90.38 90.79 90.97 91.20 91.22 91.3 91.31 91.44 ## 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps 23 comps ## X 93.76 94.61 95.36 96.06 96.74 97.28 97.75 ## .outcome 91.44 91.46 91.46 91.63 91.63 91.83 91.84 ## 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps 30 comps ## X 98.20 98.56 98.83 99.06 99.28 99.47 99.61 ## .outcome 92.16 92.16 92.83 92.88 92.92 93.55 93.63 ## 31 comps 32 comps ## X 99.74 99.85 ## .outcome 93.90 94.61 ``` ] --- ## Partial least squares (PLS) __PCR is agnostic of response variable__ <img src="https://bradleyboehmke.github.io/HOML/images/pls-vs-pcr.png" width="80%" style="display: block; margin: auto;" /> --- ## PLS as supervised dimension reduction __First principal component__ in PCA: `$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$` -- In PLS we set `\(\phi_{j1}\)` to the coefficient from __simple linear regression__ of `\(Y\)` on each `\(X_j\)` - Remember this slope is proportional to the correlation! `\(\widehat{\beta}_{} = r_{X,Y} \cdot \frac{s_Y}{s_X}\)` - Thus `\(Z_1\)` in PLS places most weight on variables strongly related to response `\(Y\)` -- To compute `\(Z_2\)` for PLS: - Regress each `\(X_j\)` on `\(Z_1\)`, residuals capture signal not explained by `\(Z_1\)` - Set `\(\phi_{j2}\)` to the coefficient from __simple linear regression__ of `\(Y\)` on these residuals for each variable -- Repeat process until all `\(Z_1, Z_2, \dots, Z_p\)` are computed (__PLS components__) Then regress `\(Y\)` on `\(Z_1, Z_2, \dots, Z_p^*\)`, where `\(p^* < p\)` is a tuning parameter --- ## Tuning PLS with [`caret`](http://topepo.github.io/caret/index.html) .pull-left[ ```r set.seed(2013) cv_model_pls <- train( score_diff ~ ., data = nfl_model_data, * method = "pls", trControl = trainControl(method = "cv", number = 10, selectionFunction = "oneSE"), preProcess = c("center", "scale"), tuneLength = ncol(nfl_model_data) - 1) ggplot(cv_model_pls) + theme_bw() ``` Sharp contrast with PCR results! Fewer PLS components because they are guided by the response variable ] .pull-right[ <img src="15-PCR-PLS_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" /> ] -- _But how do we summarize variable relationships without a single coefficient?_ --- ## Variable importance with [`vip` package](https://cran.r-project.org/web/packages/vip/vignettes/vip-introduction.pdf) __Variable importance__ attempts to quantify how influential variables are in the model - e.g., absolute value of `\(t\)`-statistic in regression -- __For PLS__: weighted sums of the absolute regression coefficients across components - Weights are function of reduction of RSS across the number of PLS components -- .pull-left[ ```r # Check out `cv_model_pls$finalModel$coefficients` library(vip) *vip(cv_model_pls, num_features = 10, * method = "model") + theme_bw() ``` ] .pull-right[ <img src="15-PCR-PLS_files/figure-html/unnamed-chunk-8-1.png" width="576" /> ] --- ## Partial dependence plots (PDP) with [`pdp` package](https://bgreenwell.github.io/pdp/index.html) PDPs display the change in the average predicted response as the predictor varies over their marginal distribution - More useful for non-linear models later on! ```r library(pdp) *partial(cv_model_pls, "offense_total_epa_pass", plot = TRUE) ``` <img src="15-PCR-PLS_files/figure-html/pdp-example-1.png" width="576" style="display: block; margin: auto;" />