class: center, middle, inverse, title-slide # Supervised Learning ## Variable selection ### Ron Yurko ### 06/22/2020 --- ## The setting We wish to learn a linear model. Our estimate (denoted by hats) is $$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p $$ -- Why would we attempt to select a __subset__ of the `\(p\)` variables? -- - *To improve prediction accuracy* - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias -- - *To improve model interpretability* - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response. --- ## Best subset selection - Start with the __null model__ `\(\mathcal{M}_0\)` (intercept-only) that has no predictors - just predicts the sample mean for each observation -- - For `\(k = 1, 2, \dots, p\)` (each possible number of predictors) - Fit __all__ `\(\binom{p}{k} = \frac{p!}{k!(p-k)!}\)` with exactly `\(k\)` predictors - Pick the best (some criteria) among these `\(\binom{p}{k}\)` models, call it `\(\mathcal{M}_k\)` - Best can be up to the user: cross-validation error, highest adjusted `\(R^2\)`, etc. -- - Select a single best model from among `\(\mathcal{M}_0, \dots, \mathcal{M}_p\)` -- __This is not typically used in research!__ - only practical for a smaller number of variables -- - arbitrary way of defining __best__ and ignores __prior knowledge__ about potential predictors --- ## Use the shoe leather approach [Prof. David Freeman](https://en.wikipedia.org/wiki/David_A._Freedman): - algorithms can be tempting but they are NOT substitutes! - you should NOT avoid the hard work of EDA in your modeling efforts -- __Variable selection is a difficult problem!__ - Like much of a statistics & data science research there is not one unique, correct answer -- You should justify which predictors / variables used in modeling: - __based on extensive EDA__ and - model assessment __based on test set predictions__ --- ## Covariance and correlation - __Covariance__ is a measure of the __linear__ dependence between two variables - To be _"uncorrelated"_ is not the same as to be _"independent"_... - Independence means __there is no dependence__, linear or otherwise -- - __Correlation__ is a _normalized_ form of variance, ranges from -1 through 0 to 1 - -1 means one variable linearly decreases absolutely in value while the other increases in value - 0 means no linear dependence - 1 means one variable linear increases absolutely while the other increases -- - We can use the `cov()` / `cor()` functions in `R` to generate the __covariance__ / __correlation__ matrices --- ## Example data: NFL teams summary Created dataset using [`nflscrapR-data`](https://github.com/ryurko/nflscrapR-data) summarizing NFL team performances from 2009 to 2019 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/eda_projects/nfl_teams_season_summary.csv") nfl_teams_data ``` ``` ## # A tibble: 352 x 24 ## team season points_scored points_allowed wins losses ties pass_off_epa_pe… ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 PIT 2009 368 324 9 7 0 0.119 ## 2 PIT 2010 375 232 12 4 0 0.161 ## 3 PIT 2011 325 227 12 4 0 0.115 ## 4 PIT 2012 336 314 8 8 0 0.0491 ## 5 PIT 2013 379 370 8 8 0 0.102 ## 6 PIT 2014 436 368 11 5 0 0.244 ## 7 PIT 2015 423 319 10 6 0 0.108 ## 8 PIT 2016 399 327 11 5 0 0.166 ## 9 PIT 2017 406 308 13 3 0 0.182 ## 10 PIT 2018 428 360 9 6 1 0.179 ## # … with 342 more rows, and 16 more variables: pass_off_total_epa <dbl>, ## # pass_off_total_yards_gained <dbl>, pass_off_yards_gained_per_att <dbl>, ## # run_off_epa_per_att <dbl>, run_off_total_epa <dbl>, run_off_total_yards_gained <dbl>, ## # run_off_yards_gained_per_att <dbl>, pass_def_epa_per_att <dbl>, ## # pass_def_total_epa <dbl>, pass_def_total_yards_allowed <dbl>, ## # pass_def_yards_allowed_per_att <dbl>, run_def_epa_per_att <dbl>, ## # run_def_total_epa <dbl>, run_def_total_yards_allowed <dbl>, ## # run_def_yards_allowed_per_att <dbl>, division <chr> ``` --- ## Modeling NFL score differential .pull-left[ Interested in modeling a team's __score differential__ ```r nfl_teams_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) nfl_teams_data %>% ggplot(aes(x = score_diff)) + geom_histogram(color = "black", fill = "darkblue", alpha = 0.3) + theme_bw() + labs(x = "Score differential") ``` - Distribution is roughly symmetric and unimodal - Supports assumption of normal distribution - no transformation necessary ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] --- ## Correlation matrix of score differential and candidate predictors .pull-left[ - Interested in `score_diff` relationships with team passing and rush statistics - View the correlation matrix with [`ggcorrplot`](https://rpkgs.datanovia.com/ggcorrplot/) ```r library(ggcorrplot) nfl_model_data <- nfl_teams_data %>% dplyr::select(score_diff, pass_off_epa_per_att, run_off_epa_per_att, pass_def_epa_per_att, run_def_epa_per_att, pass_off_yards_gained_per_att, run_off_yards_gained_per_att, pass_def_yards_allowed_per_att, run_def_yards_allowed_per_att) *nfl_cor_matrix <- cor(nfl_model_data) *ggcorrplot(nfl_cor_matrix) ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ## Customize the appearance of the correlation matrix .pull-left[ - Avoid redundancy by only using one half of matrix with `type` - Add correlation value labels using `lab` (but round first!) - Can arrange variables based on clustering... ```r round_cor_matrix <- * round(cor(nfl_model_data), 2) ggcorrplot(round_cor_matrix, * hc.order = TRUE, * type = "lower", * lab = TRUE) ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-4-1.png" width="504" /> ] --- ## Clustering variables using the correlation matrix We can apply [hierarchical clustering](http://www.stat.cmu.edu/cmsac/sure/materials/lectures/slides/06-intro-clustering.html#26) to the variables instead of observations -- - Select the explanatory variables of interest from our data ```r nfl_ex_vars <- dplyr::select(nfl_model_data, -score_diff) ``` -- - Compute correlation matrix of these variables: ```r exp_cor_matrix <- cor(nfl_ex_vars) ``` -- - Correlations measure similarity and can be negative, while distances measure dissimilarity and cannot be negative. As such, convert your correlations to instead be one minus the absolute value of the correlations, so that correlations near 1 or -1 will have distances of 0, and correlations near 0 will have distances of 1 ```r cor_dist_matrix <- 1 - abs(exp_cor_matrix) ``` -- - Convert to distance matrix before using `hclust` ```r cor_dist_matrix <- as.dist(cor_dist_matrix) ``` --- ## Clustering variables using the correlation matrix .pull-left[ - Cluster variables using `hclust()` as before! - Use [`ggdendro`](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html) to quickly visualize dendrogram ```r library(ggdendro) *nfl_exp_hc <- hclust(cor_dist_matrix, "complete") *ggdendrogram(nfl_exp_hc, * rotate = TRUE, * size = 2) ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## Clustering variables using the correlation matrix .pull-left[ - Another flexible option is [`dendextend`](https://cran.r-project.org/web/packages/dendextend/vignettes/dendextend.html) ```r library(dendextend) cor_dist_matrix %>% hclust() %>% as.dendrogram() %>% set("branches_k_col", k = 2) %>% set("labels_cex", .9) %>% ggplot(horiz = TRUE) ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-6-1.png" width="504" /> ] --- ## Back to the response variable... .pull-left[ Reminder: use the [`GGally`](https://ggobi.github.io/ggally/index.html) package to easily create __pairs__ plots of multiple variables - __always look at your data__ - correlation values alone are not enough! - what if a variable displayed a quadratic relationship? ```r *library(GGally) *ggpairs(nfl_model_data, columns = c("score_diff", "pass_off_epa_per_att", "run_off_epa_per_att", "pass_def_epa_per_att", * "run_def_epa_per_att")) ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- ## Do running statistics matter for modeling score differential? -- Will use __5-fold cross-validation__ to assess how well different sets of variables (combinations of `pass` & `run` variables) perform in predicting `score_diff`? -- Can initialize a column of the __test__ fold assignments to our dataset with the `sample()` function: ```r set.seed(2020) nfl_model_data <- nfl_model_data %>% mutate(test_fold = sample(rep(1:5, length.out = n()))) ``` -- __Always remember to set your seed prior to any k-fold cross-validation!__ --- ## Writing a function for k-fold cross-validation ```r get_cv_preds <- function(model_formula, data = nfl_model_data) { # generate holdout predictions for every row based season map_dfr(unique(data$test_fold), function(holdout) { # Separate test and training data: test_data <- data %>% filter(test_fold == holdout) train_data <- data %>% filter(test_fold != holdout) # Train model: reg_model <- lm(as.formula(model_formula), data = train_data) # Return tibble of holdout results: tibble(test_preds = predict(reg_model, newdata = test_data), test_actual = test_data$score_diff, test_fold = holdout) }) } ``` --- ## Function enables easy generation of holdout analysis Can get test data predictions using the same folds for different model formulas: ```r all_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + pass_def_epa_per_att + run_def_epa_per_att") all_int_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att*run_off_epa_per_att + pass_def_epa_per_att*run_def_epa_per_att") run_only_cv_preds <- get_cv_preds("score_diff ~ run_off_epa_per_att + run_def_epa_per_att") pass_only_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + pass_def_epa_per_att") off_only_cv_preds <- get_cv_preds("score_diff ~ pass_off_epa_per_att + run_off_epa_per_att") def_only_cv_preds <- get_cv_preds("score_diff ~ pass_def_epa_per_att + run_def_epa_per_att") ``` -- .pull-left[ Can then summarize together for a single plot: ```r bind_rows(mutate(all_cv_preds, type = "All"), mutate(all_int_cv_preds, type = "All w/ interactions"), mutate(pass_only_cv_preds, type = "Passing only"), mutate(run_only_cv_preds, type = "Running only"), mutate(off_only_cv_preds, type = "Offense only"), mutate(def_only_cv_preds, type = "Defense only")) %>% group_by(type) %>% summarize(rmse = sqrt(mean((test_actual - test_preds)^2))) %>% mutate(type = fct_reorder(type, rmse)) %>% ggplot(aes(x = type, y = rmse)) + geom_point() + coord_flip() + theme_bw() ``` ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Fit selected model on all data and view summary ```r all_lm <- lm(score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + pass_def_epa_per_att + run_def_epa_per_att, data = nfl_model_data) summary(all_lm) ``` ``` ## ## Call: ## lm(formula = score_diff ~ pass_off_epa_per_att + run_off_epa_per_att + ## pass_def_epa_per_att + run_def_epa_per_att, data = nfl_model_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -78.723 -22.463 -1.139 19.980 99.841 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.918 2.460 1.186 0.236 ## pass_off_epa_per_att 485.822 12.151 39.981 <2e-16 *** ## run_off_epa_per_att 315.772 22.678 13.924 <2e-16 *** ## pass_def_epa_per_att -473.925 16.078 -29.476 <2e-16 *** ## run_def_epa_per_att -282.359 24.132 -11.701 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 28.87 on 347 degrees of freedom ## Multiple R-squared: 0.9201, Adjusted R-squared: 0.9192 ## F-statistic: 998.6 on 4 and 347 DF, p-value: < 2.2e-16 ``` --- ## Do NOT show that summary in a presentation! .pull-left[ - We can instead a coefficient plot with their respective confidence intervals based on the reported standard errors - Use the [`ggcoef()`](https://ggobi.github.io/ggally/articles/ggcoef.html) function from `GGally` ```r ggcoef(all_lm, exclude_intercept = TRUE, vline = TRUE, vline_color = "red") + theme_bw() ``` - [__A well formatted table__](https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html) of the summary output is appropriate for a report (not for a presentation) ] .pull-right[ <img src="12-varsel_files/figure-html/unnamed-chunk-12-1.png" width="504" /> ]