class: center, middle, inverse, title-slide # Supervised Learning ## Regularization ### June 29th, 2022 --- ## Previously... We talked about variable selection in the linear model context: $$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_p x_p $$ Why would we attempt to select a __subset__ of the `\(p\)` variables? -- - *To improve prediction accuracy* - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias -- - *To improve model interpretability* - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response. -- __Remember the bias-variance tradeoff__: $$ {\rm MSE} = {\rm (Bias)}^2 + {\rm Variance} $$ - Introduce bias but decrease variance to improve predictions --- ## Shrinkage methods: Ridge regression __Remember__: linear regression estimates coefficients by minimizing: `$$\text{RSS} = \sum_i^n \big(Y_i - \beta_0 - \sum_j^p \hat{\beta}_j x_{ij}\big)^2$$` -- __Ridge regression__ introduces a __shrinkage penalty__ `\(\lambda \geq 0\)` by minimizing: `$$\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j x_{ij}\big)^2 + \lambda \sum_j^p \beta_j^2 = \text{RSS} + \lambda \sum_j^p \beta_j^2$$` -- - as `\(\lambda\)` increases `\(\Rightarrow\)` flexibility of models decreases - __increases bias, but decreases variance__ -- - for fixed value of `\(\lambda\)`, ridge regression fits only a single model - need to use cross-validation to __tune__ `\(\lambda\)` --- ## Shrinkage methods: Ridge regression For example: note how the magnitude of the coefficient for `Income` trends as `\(\lambda \rightarrow \infty\)` <img src="http://www.stat.cmu.edu/~pfreeman/Ridge.png" width="40%" style="display: block; margin: auto;" /> -- The coefficient __shrinks towards zero__, but never actually reaches it - `Income` is always a variable in the learned model, regardless of the value of `\(\lambda\)` --- ## Shrinkage methods: Lasso regression Ridge regression __keeps all variables__ But we may believe there is a __sparse__ solution -- __Lasso__ enables variable selection with `\(\lambda\)` by minimizing: `$$\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_j^p \vert \beta_j \vert$$` - Lasso uses an `\(\ell_1\)` ("ell 1") penalty -- - as `\(\lambda\)` increases `\(\Rightarrow\)` flexibility of models decreases - __increases bias, but decreases variance__ -- - Can handle the `\(p > n\)` case, i.e. more variables than observations! --- ## Shrinkage methods: Lasso regression Lasso regression __performs variable selection__ yielding __sparse__ models <img src="http://www.stat.cmu.edu/~pfreeman/Lasso.png" width="40%" style="display: block; margin: auto;" /> -- The coefficient shrinks towards and __eventually equals zero__ at `\(\lambda \approx 1000\)` - if the optimum value of `\(\lambda\)` is larger, then `Income` would NOT be included in the learned model --- ## Which do we use? -- <img src="https://c.tenor.com/odyVsZbC-OYAAAAC/why-not-both-why-not.gif" width="80%" style="display: block; margin: auto;" /> --- ## Best of both worlds? Elastic net `$$\sum_{i}^{n}\left(Y_{i}-\beta_{0}-\sum_{j}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left[(1-\alpha)\|\beta\|_{2}^{2} / 2+\alpha\|\beta\|_{1} \right]$$` - `\(\vert \vert \beta \vert \vert_1\)` is the `\(\ell_1\)` norm: `\(\vert \vert \beta \vert \vert_1 = \sum_j^p \vert \beta_j \vert\)` - `\(\vert \vert \beta \vert \vert_2\)` is the `\(\ell_2\)`, Euclidean, norm: `\(\vert \vert \beta \vert \vert_2 = \sqrt{\sum_j^p \beta_j^2}\)` -- - Ridge penalty: `\(\lambda \cdot (1 - \alpha) / 2\)` - Lasso penalty: `\(\lambda \cdot \alpha\)` -- - `\(\alpha\)` controls the __mixing__ between the two types, ranges from 0 to 1 - `\(\alpha = 1\)` returns lasso - `\(\alpha = 0\)` return ridge --- ## Caveats to consider... - For either ridge, lasso, or elastic net: __you should standardize your data__ - Common convention: within each column, compute then subtract off the sample mean, and compute the divide off the sample standard deviation: `$$\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{s_{x,j}}$$` - [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) package does this by default and reports coefficients on the original scale -- - `\(\lambda\)` and `\(\alpha\)` are __tuning parameters__ - Have to select appropriate values based on test data / cross-validation - When using `glmnet`, the `cv.glmnet()` function will perform the cross-validation for you --- ## Example data: NFL teams summary Created dataset using [`nflfastR`](https://www.nflfastr.com/) summarizing NFL team performances from 1999 to 2021 ```r library(tidyverse) nfl_teams_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/regression_examples/nfl_team_season_summary.csv") nfl_model_data <- nfl_teams_data %>% mutate(score_diff = points_scored - points_allowed) %>% # Only use rows with air yards filter(season >= 2006) %>% dplyr::select(-wins, -losses, -ties, -points_scored, -points_allowed, -season, -team) ``` --- ## Introduction to `glmnet` We will use the [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin) package for ridge, lasso, and elastic net ```r library(glmnet) ``` -- __Important__: `glmnet` does NOT use formula but instead `x` matrix of predictors and `y` vector of response - could use the `model.matrix()` function (which converts factors to 0-1 dummy variables!) ```r model_x <- nfl_model_data %>% dplyr::select(-score_diff) %>% as.matrix() model_y <- nfl_model_data$score_diff # model_x <- model.matrix(score_diff ~ ., nfl_model_data)[, -1] ``` --- ## Initial model with `lm()` .pull-left[ - What do the initial regression coefficients look like? - Use [`broom`](https://broom.tidymodels.org/reference/tidy.cv.glmnet.html) to tidy model output for plotting ```r init_reg_fit <- lm(score_diff ~ ., nfl_model_data) library(broom) tidy(init_reg_fit) %>% mutate(coef_sign = as.factor(sign(estimate)), term = fct_reorder(term, estimate)) %>% ggplot(aes(x = term, y = estimate, fill = coef_sign)) + geom_bar(stat = "identity", color = "white") + scale_fill_manual(values = c("darkred", "darkblue"), guide = FALSE) + coord_flip() + theme_bw() ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- ## Ridge regression example Perform ridge regression using `glmnet` with `alpha = 0` (more on that later) By default it standardizes your predictors and fits model across a range of `\(\lambda\)` values (can plot these!) ```r *init_ridge_fit <- glmnet(model_x, model_y, alpha = 0) *plot(init_ridge_fit, xvar = "lambda") ``` <img src="13-Regularization_files/figure-html/init-ridge-ex-1.png" width="504" style="display: block; margin: auto;" /> --- ## Ridge regression example We use cross-validation to select `\(\lambda\)` with `cv.glmnet()` which uses 10-folds by default - specify ridge regression with `alpha = 0` ```r *fit_ridge_cv <- cv.glmnet(model_x, model_y, alpha = 0) plot(fit_ridge_cv) ``` <img src="13-Regularization_files/figure-html/ridge-ex-1.png" width="504" style="display: block; margin: auto;" /> --- ## Tidy ridge regression .pull-left[ ```r *tidy_ridge_coef <- tidy(fit_ridge_cv$glmnet.fit) tidy_ridge_coef %>% ggplot(aes(x = lambda, y = estimate, group = term)) + scale_x_log10() + geom_line(alpha = 0.75) + geom_vline(xintercept = fit_ridge_cv$lambda.min) + geom_vline(xintercept = fit_ridge_cv$lambda.1se, linetype = "dashed", color = "red") + theme_bw() ``` - Could easily add color with legend for variables... ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Tidy ridge regression .pull-left[ ```r tidy_ridge_cv <- tidy(fit_ridge_cv) tidy_ridge_cv %>% ggplot(aes(x = lambda, y = estimate)) + geom_line() + scale_x_log10() + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .25) + geom_vline(xintercept = fit_ridge_cv$lambda.min) + geom_vline(xintercept = fit_ridge_cv$lambda.1se, linetype = "dashed", color = "red") + theme_bw() ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Ridge regression coefficients .pull-left[ Coefficients using the __1 standard error rule__ `\(\lambda\)` ```r tidy_ridge_coef %>% filter(lambda == fit_ridge_cv$lambda.1se) %>% mutate(coef_sign = as.factor(sign(estimate)), term = fct_reorder(term, estimate)) %>% ggplot(aes(x = term, y = estimate, fill = coef_sign)) + geom_bar(stat = "identity", color = "white") + scale_fill_manual(values = c("darkred", "darkblue"), guide = FALSE) + coord_flip() + theme_bw() ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Lasso regression example .pull-left[ Similar syntax to ridge but specify `alpha = 1`: ```r fit_lasso_cv <- cv.glmnet(model_x, model_y, * alpha = 1) tidy_lasso_coef <- tidy(fit_lasso_cv$glmnet.fit) tidy_lasso_coef %>% ggplot(aes(x = lambda, y = estimate, group = term)) + scale_x_log10() + geom_line(alpha = 0.75) + geom_vline(xintercept = fit_lasso_cv$lambda.min) + geom_vline(xintercept = fit_lasso_cv$lambda.1se, linetype = "dashed", color = "red") + theme_bw() ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- ## Lasso regression example .pull-left[ Number of non-zero predictors by `\(\lambda\)` ```r tidy_lasso_cv <- tidy(fit_lasso_cv) tidy_lasso_cv %>% ggplot(aes(x = lambda, y = nzero)) + geom_line() + geom_vline(xintercept = fit_lasso_cv$lambda.min) + geom_vline(xintercept = fit_lasso_cv$lambda.1se, linetype = "dashed", color = "red") + scale_x_log10() + theme_bw() ``` Reduction in variables using __1 standard error rule__ `\(\lambda\)` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Lasso regression example .pull-left[ Coefficients using the __1 standard error rule__ `\(\lambda\)` ```r tidy_lasso_coef %>% filter(lambda == fit_lasso_cv$lambda.1se) %>% mutate(coef_sign = as.factor(sign(estimate)), term = fct_reorder(term, estimate)) %>% ggplot(aes(x = term, y = estimate, fill = coef_sign)) + geom_bar(stat = "identity", color = "white") + scale_fill_manual(values = c("darkred", "darkblue"), guide = FALSE) + coord_flip() + theme_bw() ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- ## Elastic net example Need to tune both `\(\lambda\)` and `\(\alpha\)` - can do so manually with our own folds ```r set.seed(2020) fold_id <- sample(rep(1:10, length.out = nrow(model_x))) ``` Then use cross-validation with these folds for different candidate `alpha` values: ```r cv_en_25 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .25) cv_en_50 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .5) cv_ridge <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0) cv_lasso <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 1) ``` Can see which one had the lowest CV error among its candidate `\(\lambda\)` values: ```r which.min(c(min(cv_en_25$cvm), min(cv_en_50$cvm), min(cv_ridge$cvm), min(cv_lasso$cvm))) ``` ``` ## [1] 4 ``` --- ## Elastic net example .pull-left[ Can view same type of summary ```r tidy(cv_en_50) %>% ggplot(aes(x = lambda, y = nzero)) + geom_line() + geom_vline(xintercept = cv_en_50$lambda.min) + geom_vline(xintercept = cv_en_50$lambda.1se, linetype = "dashed", color = "red") + scale_x_log10() + theme_bw() ``` - More relaxed than lasso for variable entry ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] --- #### Comparison of models based on holdout performance ```r set.seed(2020) nfl_model_data <- nfl_model_data %>% mutate(test_fold = sample(rep(1:5, length.out = n()))) holdout_predictions <- map_dfr(unique(nfl_model_data$test_fold), function(holdout) { # Separate test and training data: test_data <- nfl_model_data %>% filter(test_fold == holdout) train_data <- nfl_model_data %>% filter(test_fold != holdout) # Repeat for matrices test_x <- as.matrix(dplyr::select(test_data, -score_diff)) train_x <- as.matrix(dplyr::select(train_data, -score_diff)) # Train models: lm_model <- lm(score_diff ~ ., data = train_data) ridge_model <- cv.glmnet(train_x, train_data$score_diff, alpha = 0) lasso_model <- cv.glmnet(train_x, train_data$score_diff, alpha = 1) en_model <- cv.glmnet(train_x, train_data$score_diff, alpha = .5) # Return tibble of holdout results: tibble(lm_preds = predict(lm_model, newdata = test_data), ridge_preds = as.numeric(predict(ridge_model, newx = test_x)), lasso_preds = as.numeric(predict(lasso_model, newx = test_x)), en_preds = as.numeric(predict(en_model, newx = test_x)), test_actual = test_data$score_diff, test_fold = holdout) }) ``` --- ## Predictions compared to `lm`? .pull-left[ Compute RMSE across folds with std error intervals ```r holdout_predictions %>% pivot_longer(lm_preds:en_preds, names_to = "type", values_to = "test_preds") %>% group_by(type, test_fold) %>% summarize(rmse = sqrt(mean((test_actual - test_preds)^2))) %>% ggplot(aes(x = type, y = rmse)) + geom_point() + theme_bw() + stat_summary(fun = mean, geom = "point", color = "red") + stat_summary(fun.data = mean_se, geom = "errorbar", color = "red") ``` ] .pull-right[ <img src="13-Regularization_files/figure-html/unnamed-chunk-17-1.png" width="504" /> ] In this case `lm` actually "beat" regularization, but within intervals