class: center, middle, inverse, title-slide # Supervised Learning ## Intro to variable selection ### June 28th, 2022 --- ## The setting We wish to learn a linear model. Our estimate (denoted by hats) is $$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p $$ -- Why would we attempt to select a __subset__ of the `\(p\)` variables? -- - *To improve prediction accuracy* - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias -- - *To improve model interpretability* - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response. --- ## Best subset selection - Start with the __null model__ `\(\mathcal{M}_0\)` (intercept-only) that has no predictors - just predicts the sample mean for each observation -- - For `\(k = 1, 2, \dots, p\)` (each possible number of predictors) - Fit __all__ `\(\binom{p}{k} = \frac{p!}{k!(p-k)!}\)` with exactly `\(k\)` predictors - Pick the best (some criteria) among these `\(\binom{p}{k}\)` models, call it `\(\mathcal{M}_k\)` - Best can be up to the user: cross-validation error, highest adjusted `\(R^2\)`, etc. -- - Select a single best model from among `\(\mathcal{M}_0, \dots, \mathcal{M}_p\)` -- __This is not typically used in research!__ - only practical for a smaller number of variables -- - arbitrary way of defining __best__ and ignores __prior knowledge__ about potential predictors --- ## Use the shoe leather approach [Prof. David Freeman](https://en.wikipedia.org/wiki/David_A._Freedman): - algorithms can be tempting but they are NOT substitutes! - you should NOT avoid the hard work of EDA in your modeling efforts -- __Variable selection is a difficult problem!__ - Like much of a statistics & data science research there is not one unique, correct answer -- You should justify which predictors / variables used in modeling based on: - __context__, - __extensive EDA__, and - __model assessment based on holdout predictions__ --- ## Covariance and correlation - __Covariance__ is a measure of the __linear__ dependence between two variables - To be _"uncorrelated"_ is not the same as to be _"independent"_... - Independence means __there is no dependence__, linear or otherwise -- - __Correlation__ is a _normalized_ form of covariance, ranges from -1 through 0 to 1 - -1 means one variable linearly decreases absolutely in value while the other increases in value - 0 means no linear dependence - 1 means one variable linear increases absolutely while the other increases -- - We can use the `cov()` / `cor()` functions in `R` to generate the __covariance__ / __correlation__ matrices --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv") nfl_teams_data ``` ``` ## # A tibble: 733 x 55 ## season team offense_completi… offense_total_ya… offense_total_y… offense_ave_yar… ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1999 ARI 0.477 2796 1209 4.67 ## 2 1999 ATL 0.504 3317 1176 6.08 ## 3 1999 BAL 0.452 2805 1663 5.07 ## 4 1999 BUF 0.540 3275 2038 6.17 ## 5 1999 CAR 0.552 4144 1484 6.68 ## 6 1999 CHI 0.561 4090 1359 5.75 ## 7 1999 CIN 0.498 3178 1971 5.37 ## 8 1999 CLE 0.489 2574 1140 4.71 ## 9 1999 DAL 0.560 3083 2054 5.95 ## 10 1999 DEN 0.546 3378 1852 5.85 ## # … with 723 more rows, and 49 more variables: offense_ave_yards_gained_run <dbl>, ## # offense_total_air_yards <dbl>, offense_ave_air_yards <dbl>, ## # offense_total_yac <dbl>, offense_ave_yac <dbl>, offense_n_plays_pass <dbl>, ## # offense_n_plays_run <dbl>, offense_n_interceptions <dbl>, ## # offense_n_fumbles_lost_pass <dbl>, offense_n_fumbles_lost_run <dbl>, ## # offense_total_epa_pass <dbl>, offense_total_epa_run <dbl>, ## # offense_ave_epa_pass <dbl>, offense_ave_epa_run <dbl>, ## # offense_total_wpa_pass <dbl>, offense_total_wpa_run <dbl>, ## # offense_ave_wpa_pass <dbl>, offense_ave_wpa_run <dbl>, ## # offense_success_rate_pass <dbl>, offense_success_rate_run <dbl>, ## # defense_completion_percentage <dbl>, defense_total_yards_gained_pass <dbl>, ## # defense_total_yards_gained_run <dbl>, defense_ave_yards_gained_pass <dbl>, ## # defense_ave_yards_gained_run <dbl>, defense_total_air_yards <dbl>, ## # defense_ave_air_yards <dbl>, defense_total_yac <dbl>, defense_ave_yac <dbl>, ## # defense_n_plays_pass <dbl>, defense_n_plays_run <dbl>, ## # defense_n_interceptions <dbl>, defense_n_fumbles_lost_pass <dbl>, ## # defense_n_fumbles_lost_run <dbl>, defense_total_epa_pass <dbl>, ## # defense_total_epa_run <dbl>, defense_ave_epa_pass <dbl>, ## # defense_ave_epa_run <dbl>, defense_total_wpa_pass <dbl>, ## # defense_total_wpa_run <dbl>, defense_ave_wpa_pass <dbl>, ## # defense_ave_wpa_run <dbl>, defense_success_rate_pass <dbl>, ## # defense_success_rate_run <dbl>, points_scored <dbl>, points_allowed <dbl>, ## # wins <dbl>, losses <dbl>, ties <dbl> ``` --- ## Modeling NFL score differential .pull-left[ Interested in modeling a team's __score differential__ ```r nfl_teams_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) nfl_teams_data %>% ggplot(aes(x = score_diff)) + geom_histogram(color = "black", fill = "darkblue", alpha = 0.3) + theme_bw() + labs(x = "Score differential") ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- ### Correlation matrix of score differential and candidate predictors .pull-left[ - Interested in `score_diff` relationships with team passing and rush statistics - View the correlation matrix with [`ggcorrplot`](https://rpkgs.datanovia.com/ggcorrplot/) ```r library(ggcorrplot) nfl_model_data <- nfl_teams_data %>% dplyr::select(score_diff, offense_ave_epa_pass, offense_ave_epa_run, defense_ave_epa_pass, defense_ave_epa_run, offense_ave_yards_gained_pass, offense_ave_yards_gained_run, defense_ave_yards_gained_pass, defense_ave_yards_gained_run) *nfl_cor_matrix <- cor(nfl_model_data) *ggcorrplot(nfl_cor_matrix) ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## Customize the appearance of the correlation matrix .pull-left[ - Avoid redundancy by only using one half of matrix with `type` - Add correlation value labels using `lab` (but round first!) - Can arrange variables based on clustering... ```r round_cor_matrix <- * round(cor(nfl_model_data), 2) ggcorrplot(round_cor_matrix, * hc.order = TRUE, * type = "lower", * lab = TRUE) ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ## Clustering variables using the correlation matrix Apply [hierarchical clustering](http://stat.cmu.edu/cmsac/sure/2021/materials/lectures/slides/07-Hierarchical-clustering.html#1) to variables instead of observations -- - Select the explanatory variables of interest from our data ```r nfl_ex_vars <- dplyr::select(nfl_model_data, -score_diff) ``` -- - Compute correlation matrix of these variables: ```r exp_cor_matrix <- cor(nfl_ex_vars) ``` -- - Correlations measure similarity and can be negative __BUT__ distances measure dissimilarity and __CANNOT__ - Convert your correlations to all be `\(\geq 0\)`: e.g., `\(1 - |\rho|\)` (which drops the sign) or `\(1 - \rho\)` ```r cor_dist_matrix <- 1 - abs(exp_cor_matrix) ``` -- - Convert to distance matrix before using `hclust` ```r cor_dist_matrix <- as.dist(cor_dist_matrix) ``` --- ## Clustering variables using the correlation matrix .pull-left[ - Cluster variables using `hclust()` as before! - Use [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) to quickly visualize dendrogram ```r library(ggdendro) *nfl_exp_hc <- hclust(cor_dist_matrix, "complete") *ggdendrogram(nfl_exp_hc, * rotate = TRUE, * size = 2) ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## Clustering variables using the correlation matrix .pull-left[ - Another flexible option is [`dendextend`](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html) ```r library(dendextend) cor_dist_matrix %>% hclust() %>% * as.dendrogram() %>% set("branches_k_col", k = 2) %>% set("labels_cex", .5) %>% ggplot(horiz = TRUE) ``` - Explore the [package documentation](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html) for more formatting ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## Back to the response variable... .pull-left[ Use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables - __always look at your data__ - correlation values alone are not enough! - what if a variable displayed a quadratic relationship? ```r *library(GGally) *ggpairs(nfl_model_data, columns = c("score_diff", "offense_ave_epa_pass", "offense_ave_epa_run", "defense_ave_epa_pass", * "defense_ave_epa_run")) + theme_bw() ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ### Do running statistics matter for modeling score differential? -- Will use __5-fold cross-validation__ to assess how well different sets of variables (combinations of `pass` & `run` variables) perform in predicting `score_diff`? -- Can initialize a column of the __test__ fold assignments to our dataset with the `sample()` function: ```r set.seed(2020) nfl_model_data <- nfl_model_data %>% mutate(test_fold = sample(rep(1:5, length.out = n()))) ``` -- __Always remember to set your seed prior to any k-fold cross-validation!__ --- ## Writing a function for k-fold cross-validation ```r get_cv_preds <- function(model_formula, data = nfl_model_data) { # generate holdout predictions for every row based season map_dfr(unique(data$test_fold), function(holdout) { # Separate test and training data: test_data <- data %>% filter(test_fold == holdout) train_data <- data %>% filter(test_fold != holdout) # Train model: reg_model <- lm(as.formula(model_formula), data = train_data) # Return tibble of holdout results: tibble(test_preds = predict(reg_model, newdata = test_data), test_actual = test_data$score_diff, test_fold = holdout) }) } ``` --- ## Function enables easy generation of holdout analysis ```r all_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass + offense_ave_epa_run + defense_ave_epa_pass + defense_ave_epa_run") all_int_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass*offense_ave_epa_run + defense_ave_epa_pass*defense_ave_epa_run") run_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_run + defense_ave_epa_run") pass_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass + defense_ave_epa_pass") off_only_cv_preds <- get_cv_preds("score_diff ~ offense_ave_epa_pass + offense_ave_epa_run") def_only_cv_preds <- get_cv_preds("score_diff ~ defense_ave_epa_pass + defense_ave_epa_run") int_only_cv_preds <- get_cv_preds("score_diff ~ 1") ``` -- .pull-left[ Can then summarize together for a single plot: ```r bind_rows(mutate(all_cv_preds, type = "All"), mutate(all_int_cv_preds, type = "All w/ interactions"), mutate(pass_only_cv_preds, type = "Passing only"), mutate(run_only_cv_preds, type = "Running only"), mutate(off_only_cv_preds, type = "Offense only"), mutate(def_only_cv_preds, type = "Defense only"), mutate(int_only_cv_preds, type = "Intercept-only")) %>% group_by(type) %>% summarize(rmse = sqrt(mean((test_actual - test_preds)^2))) %>% mutate(type = fct_reorder(type, rmse)) %>% ggplot(aes(x = type, y = rmse)) + geom_point() + coord_flip() + theme_bw() ``` ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Fit selected model on all data and view summary ```r all_lm <- lm(score_diff ~ offense_ave_epa_pass + offense_ave_epa_run + defense_ave_epa_pass + defense_ave_epa_run, data = nfl_model_data) summary(all_lm) ``` ``` ## ## Call: ## lm(formula = score_diff ~ offense_ave_epa_pass + offense_ave_epa_run + ## defense_ave_epa_pass + defense_ave_epa_run, data = nfl_model_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -75.142 -18.394 0.024 18.680 92.412 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.378 1.648 2.05 0.0407 * ## offense_ave_epa_pass 463.221 8.529 54.31 <2e-16 *** ## offense_ave_epa_run 336.067 15.283 21.99 <2e-16 *** ## defense_ave_epa_pass -480.406 10.909 -44.04 <2e-16 *** ## defense_ave_epa_run -302.841 15.883 -19.07 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 27.67 on 728 degrees of freedom ## Multiple R-squared: 0.9263, Adjusted R-squared: 0.9259 ## F-statistic: 2288 on 4 and 728 DF, p-value: < 2.2e-16 ``` --- ## Do NOT show that summary in a presentation! .pull-left[ - We can instead display a __coefficient plot__ with confidence intervals based on the reported standard errors - Use the [`ggcoef()`](https://ggobi.github.io/ggally/articles/ggcoef.html) function from `GGally` ```r ggcoef(all_lm, exclude_intercept = TRUE, vline = TRUE, vline_color = "red") + theme_bw() ``` - [__A well formatted table__](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html) of the summary output is appropriate for a report (not for a presentation) ] .pull-right[ <img src="12-Varsel_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ]