class: center, middle, inverse, title-slide # Supervised Learning ## Regularization ### Ron Yurko ### 06/23/2020 --- ## Yesterday... We talked about variable selection in the linear model context: $$ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p $$ Why would we attempt to select a __subset__ of the `\(p\)` variables? -- - *To improve prediction accuracy* - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias -- - *To improve model interpretability* - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response. -- __Remember the bias-variance tradeoff__: $$ {\rm MSE} = {\rm (Bias)}^2 + {\rm Variance} $$ - Introduce bias but decrease variance to improve predictions --- ## Shrinkage methods: Ridge regression __Remember__: linear regression estimates coefficients by minimizing: `$$\text{RSS} = \sum_i^n \big(Y_i - \beta_0 - \sum_j^p \hat{\beta}_j X_{ij}\big)^2$$` -- __Ridge regression__ introduces a __shrinkage penalty__ `\(\lambda \geq 0\)` by minimizing: `$$\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p \beta_j^2 = \text{RSS} + \lambda \sum_j^p \beta_j^2$$` -- - as `\(\lambda\)` increases `\(\Rightarrow\)` flexibility of models decreases - __increases bias, but decreases variance__ -- - for fixed value of `\(\lambda\)`, ridge regression fits only a single model - need to use cross-validation to __tune__ `\(\lambda\)` --- ## Shrinkage methods: Ridge regression For example: note how the magnitude of the coefficient for `Income` trends as `\(\lambda \rightarrow \infty\)` <img src="http://www.stat.cmu.edu/~pfreeman/Ridge.png" width="40%" style="display: block; margin: auto;" /> -- The coefficient __shrinks towards zero__, but never actually reaches it - `Income` is always a variable in the learned model, regardless of the value of `\(\lambda\)` --- ## Shrinkage methods: Lasso regression Ridge regression __keeps all variables__ But we may believe there is a __sparse__ solution -- __Lasso__ enables variable selection with `\(\lambda\)` by minimizing: `$$\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_j^p \vert \beta_j \vert$$` - Lasso uses an `\(\ell_1\)` ("ell 1") penalty -- - as `\(\lambda\)` increases `\(\Rightarrow\)` flexibility of models decreases - __increases bias, but decreases variance__ -- - Can handle the `\(p > n\)` case, i.e. more variables than observations! --- ## Shrinkage methods: Lasso regression Lasso regression __performs variable selection__ yielding __sparse__ models <img src="http://www.stat.cmu.edu/~pfreeman/Lasso.png" width="40%" style="display: block; margin: auto;" /> -- The coefficient shrinks towards and __eventually equals zero__ at `\(\lambda ~ 1000\)` - if the optimum value of `\(\lambda\)` is larger, then `Income` would NOT be included in the learned model --- ## Which do we use? -- <img src="https://i.gifer.com/Xnj.gif" width="80%" style="display: block; margin: auto;" /> --- ## Best of both worlds? Elastic net `$$\sum_{i}^{n}\left(Y_{i}-\beta_{0}-\sum_{j}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left[(1-\alpha)\|\beta\|_{2}^{2} / 2+\alpha\|\beta\|_{1} \right]$$` - `\(\vert \vert \beta \vert \vert_1\)` is the `\(\ell_1\)` norm: `\(\vert \vert \beta \vert \vert_1 = \sum_j^p \vert \beta_j \vert\)` - `\(\vert \vert \beta \vert \vert_2\)` is the `\(\ell_2\)`, Euclidean, norm: `\(\vert \vert \beta \vert \vert_2 = \sqrt{\sum_j^p \beta_j^2}\)` -- - Ridge penalty: `\(\lambda \cdot (1 - \alpha) / 2\)` - Lass penalty: `\(\lambda \cdot \alpha\)` -- - `\(\alpha\)` controls the __mixing__ between the two types, ranges from 0 to 1 - `\(\alpha = 1\)` returns lasso - `\(\alpha = 0\)` return ridge --- ## Caveats to consider... - For either ridge, lasso, or elastic net: __you should standardize your data__ - Common convention: within each column, compute then subtract off the sample mean, and compute the divide off the sample standard deviation: `$$\tilde{X}_{ij} = \frac{X_{ij} - \bar{X}_j}{s_{X,j}}$$` - [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) package does this by default and reports coefficients on the original scale -- - `\(\lambda\)` and `\(\alpha\)` are __tuning parameters__ - Have to select appropriate values based on test data / cross-validation - When using `glmnet`, the `cv.glmnet()` function will perform the cross-validation for you --- ## Regularization example data Will use the `Hitters` dataset from the [`ISLR`](https://cran.r-project.org/web/packages/ISLR/index.html) package ```r library(ISLR) data("Hitters") Hitters <- as_tibble(Hitters) %>% filter(!is.na(Salary)) Hitters ``` ``` ## # A tibble: 263 x 20 ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League ## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <fct> ## 1 315 81 7 24 38 39 14 3449 835 69 321 414 375 N ## 2 479 130 18 66 72 76 3 1624 457 63 224 266 263 A ## 3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N ## 4 321 87 10 39 42 30 2 396 101 12 48 46 33 N ## 5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A ## 6 185 37 1 23 8 21 2 214 42 1 30 9 24 N ## 7 298 73 0 24 24 7 3 509 108 0 41 37 12 A ## 8 323 81 6 26 32 8 2 341 86 6 32 34 8 N ## 9 401 92 17 49 66 65 13 5206 1332 253 784 890 866 A ## 10 574 159 21 107 75 59 10 4631 1300 90 702 504 488 A ## # … with 253 more rows, and 6 more variables: Division <fct>, PutOuts <int>, Assists <int>, ## # Errors <int>, Salary <dbl>, NewLeague <fct> ``` --- ## Introduction to `glmnet` We will use the [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin) package for ridge, lasso, and elastic net ```r library(glmnet) ``` -- __Important__: `glmnet` does NOT use formula but instead `x` matrix of predictors and `y` vector of response - use the `model.matrix()` function (which converts factors to 0-1 dummy variables!) ```r model_x <- model.matrix(Salary ~ ., Hitters)[, -1] model_y <- Hitters$Salary ``` -- What do the initial regression coefficients look like? ```r init_reg_fit <- lm(Salary ~ ., Hitters) coef(init_reg_fit) ``` ``` ## (Intercept) AtBat Hits HmRun Runs RBI Walks ## 163.1035878 -1.9798729 7.5007675 4.3308829 -2.3762100 -1.0449620 6.2312863 ## Years CAtBat CHits CHmRun CRuns CRBI CWalks ## -3.4890543 -0.1713405 0.1339910 -0.1728611 1.4543049 0.8077088 -0.8115709 ## LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 62.5994230 -116.8492456 0.2818925 0.3710692 -3.3607605 -24.7623251 ``` --- ## Ridge regression example We use cross-validation to select `\(\lambda\)` with `cv.glmnet()` which uses 10-folds by default - specify ridge regression with `alpha = 0` ```r fit_ridge_cv <- cv.glmnet(model_x, model_y, alpha = 0) plot(fit_ridge_cv) ``` <img src="13-shrinkage_files/figure-html/ridge-ex-1.png" width="504" style="display: block; margin: auto;" /> --- ## Ridge regression example .pull-left[ Coefficients using the __1 standard error rule__ `\(\lambda\)` ```r coef(fit_ridge_cv) ``` ``` ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 185.946732481 ## AtBat 0.096634022 ## Hits 0.408580477 ## HmRun 1.242303538 ## Runs 0.650047294 ## RBI 0.642033634 ## Walks 0.848737420 ## Years 2.608433223 ## CAtBat 0.008188531 ## CHits 0.031829975 ## CHmRun 0.235663247 ## CRuns 0.063816873 ## CRBI 0.066045116 ## CWalks 0.062642350 ## LeagueN 4.252099472 ## DivisionW -25.296959245 ## PutOuts 0.059902888 ## Assists 0.008305300 ## Errors -0.185603401 ## NewLeagueN 3.676189320 ``` ] .pull-right[ Or specify `\(\lambda\)` with minimum cross-validation error ```r coef(fit_ridge_cv, fit_ridge_cv$lambda.min) ``` ``` ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 11.089425330 ## AtBat -0.007441506 ## Hits 1.116358483 ## HmRun -0.155150648 ## Runs 1.136603508 ## RBI 0.866089829 ## Walks 1.921230858 ## Years -0.774801522 ## CAtBat 0.010834527 ## CHits 0.070207840 ## CHmRun 0.481333855 ## CRuns 0.139206329 ## CRBI 0.148340356 ## CWalks 0.009253631 ## LeagueN 30.327721487 ## DivisionW -98.193701089 ## PutOuts 0.205034904 ## Assists 0.052568534 ## Errors -2.091238621 ## NewLeagueN 5.279536552 ``` ] --- ## Lasso regression example Similar to ridge by specify `alpha = 1`: ```r fit_lasso_cv <- cv.glmnet(model_x, model_y, alpha = 1) plot(fit_lasso_cv) ``` <img src="13-shrinkage_files/figure-html/lasso-ex-1.png" width="504" style="display: block; margin: auto;" /> --- ## Lasso regression example .pull-left[ Coefficients using the __1 standard error rule__ `\(\lambda\)` ```r coef(fit_lasso_cv) ``` ``` ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 193.74263886 ## AtBat . ## Hits 1.21471320 ## HmRun . ## Runs . ## RBI . ## Walks 1.28957901 ## Years . ## CAtBat . ## CHits . ## CHmRun . ## CRuns 0.12923755 ## CRBI 0.31515925 ## CWalks . ## LeagueN . ## DivisionW . ## PutOuts 0.02533305 ## Assists . ## Errors . ## NewLeagueN . ``` ] .pull-right[ Or specify `\(\lambda\)` with minimum cross-validation error ```r coef(fit_lasso_cv, fit_lasso_cv$lambda.min) ``` ``` ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 1.395005e+02 ## AtBat -1.734125e+00 ## Hits 6.059861e+00 ## HmRun 1.708102e-01 ## Runs . ## RBI . ## Walks 5.053826e+00 ## Years -1.061379e+01 ## CAtBat -2.483637e-05 ## CHits . ## CHmRun 5.721414e-01 ## CRuns 7.218106e-01 ## CRBI 3.856339e-01 ## CWalks -6.033666e-01 ## LeagueN 3.339735e+01 ## DivisionW -1.193924e+02 ## PutOuts 2.772560e-01 ## Assists 2.104828e-01 ## Errors -2.371468e+00 ## NewLeagueN . ``` ] --- ## Elastic net example Need to tune both `\(\lambda\)` and `\(\alpha\)` - can do so manually with our own folds ```r set.seed(2020) fold_id <- sample(rep(1:10, length.out = nrow(model_x))) ``` Then use cross-validation with these folds for different candidate `alpha` values: ```r cv_en_25 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .25) cv_en_50 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .5) cv_ridge <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0) cv_lasso <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 1) ``` Can see which one had the lowest CV error among its candidate `\(\lambda\)` values: ```r which.min(c(min(cv_en_25$cvm), min(cv_en_50$cvm), min(cv_ridge$cvm), min(cv_lasso$cvm))) ``` ``` ## [1] 4 ```