Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Regularization
### Ron Yurko
### 06/23/2020

---

## Yesterday...

We talked about variable selection in the linear model context:

$$
\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p
$$

Why would we attempt to select a __subset__ of the `$p$` variables?

--
- *To improve prediction accuracy* 
  - Eliminating uninformative predictors can lead to lower variance in the test-set MSE, at the expense of a slight increase in bias

--
- *To improve model interpretability*
  - Eliminating uninformative predictors is obviously a good thing when your goal is to tell the story of how your predictors are associated with your response.

--
__Remember the bias-variance tradeoff__:

$$
{\rm MSE} = {\rm (Bias)}^2 + {\rm Variance}
$$

- Introduce bias but decrease variance to improve predictions

---

## Shrinkage methods: Ridge regression

__Remember__: linear regression estimates coefficients by minimizing:

`$$\text{RSS} = \sum_i^n \big(Y_i - \beta_0 -  \sum_j^p \hat{\beta}_j X_{ij}\big)^2$$`

--
__Ridge regression__ introduces a __shrinkage penalty__ `$\lambda \geq 0$` by minimizing:

`$$\sum_i^n \big(Y_i - \beta_0 -  \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p \beta_j^2 = \text{RSS} + \lambda \sum_j^p \beta_j^2$$`

--
- as `$\lambda$` increases `$\Rightarrow$` flexibility of models decreases

- __increases bias, but decreases variance__

--
- for fixed value of `$\lambda$`, ridge regression fits only a single model

- need to use cross-validation to __tune__ `$\lambda$`

---

## Shrinkage methods: Ridge regression

For example: note how the magnitude of the coefficient for `Income` trends as `$\lambda \rightarrow \infty$`

The coefficient __shrinks towards zero__, but never actually reaches it

- `Income` is always a variable in the learned model, regardless of the value of `$\lambda$`

---

## Shrinkage methods: Lasso regression

Ridge regression __keeps all variables__

But we may believe there is a __sparse__ solution

--
__Lasso__ enables variable selection with `$\lambda$` by minimizing:

`$$\sum_i^n \big(Y_i - \beta_0 -  \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p\vert  \beta_j \vert = \text{RSS} + \lambda \sum_j^p \vert \beta_j \vert$$`

- Lasso uses an `$\ell_1$` ("ell 1") penalty

--
- as `$\lambda$` increases `$\Rightarrow$` flexibility of models decreases

- __increases bias, but decreases variance__
  
--

- Can handle the `$p > n$` case, i.e. more variables than observations!

---

## Shrinkage methods: Lasso regression

Lasso regression __performs variable selection__ yielding __sparse__ models

The coefficient shrinks towards and __eventually equals zero__ at `$\lambda ~ 1000$`

- if the optimum value of `$\lambda$` is larger, then `Income` would NOT be included in the learned model

---

## Which do we use?

---

## Best of both worlds? Elastic net

`$$\sum_{i}^{n}\left(Y_{i}-\beta_{0}-\sum_{j}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left[(1-\alpha)\|\beta\|_{2}^{2} / 2+\alpha\|\beta\|_{1} \right]$$`
- `$\vert \vert \beta \vert \vert_1$` is the `$\ell_1$` norm: `$\vert \vert \beta \vert \vert_1 = \sum_j^p \vert \beta_j \vert$`

- `$\vert \vert \beta \vert \vert_2$` is the `$\ell_2$`, Euclidean, norm: `$\vert \vert \beta \vert \vert_2 = \sqrt{\sum_j^p \beta_j^2}$`

- Ridge penalty: `$\lambda \cdot (1 - \alpha) / 2$`

- Lass penalty: `$\lambda \cdot \alpha$`

--
- `$\alpha$` controls the __mixing__ between the two types, ranges from 0 to 1

- `$\alpha = 1$` returns lasso
  
  - `$\alpha = 0$` return ridge

---

## Caveats to consider...

- For either ridge, lasso, or elastic net: __you should standardize your data__

- Common convention: within each column, compute then subtract off the sample mean, and compute the divide off the sample standard deviation:

`$$\tilde{X}_{ij} = \frac{X_{ij} - \bar{X}_j}{s_{X,j}}$$`

- [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) package does this by default and reports coefficients on the original scale

--
- `$\lambda$` and `$\alpha$` are __tuning parameters__

- Have to select appropriate values based on test data / cross-validation

- When using `glmnet`, the `cv.glmnet()` function will perform the cross-validation for you

---

## Regularization example data

Will use the `Hitters` dataset from the [`ISLR`](https://cran.r-project.org/web/packages/ISLR/index.html) package

```r
library(ISLR)
data("Hitters")
Hitters <- as_tibble(Hitters) %>%
  filter(!is.na(Salary))
Hitters
```

```
## # A tibble: 263 x 20
##    AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns  CRBI CWalks League
##    <int> <int> <int> <int> <int> <int> <int>  <int> <int>  <int> <int> <int>  <int> <fct> 
##  1   315    81     7    24    38    39    14   3449   835     69   321   414    375 N     
##  2   479   130    18    66    72    76     3   1624   457     63   224   266    263 A     
##  3   496   141    20    65    78    37    11   5628  1575    225   828   838    354 N     
##  4   321    87    10    39    42    30     2    396   101     12    48    46     33 N     
##  5   594   169     4    74    51    35    11   4408  1133     19   501   336    194 A     
##  6   185    37     1    23     8    21     2    214    42      1    30     9     24 N     
##  7   298    73     0    24    24     7     3    509   108      0    41    37     12 A     
##  8   323    81     6    26    32     8     2    341    86      6    32    34      8 N     
##  9   401    92    17    49    66    65    13   5206  1332    253   784   890    866 A     
## 10   574   159    21   107    75    59    10   4631  1300     90   702   504    488 A     
## # … with 253 more rows, and 6 more variables: Division <fct>, PutOuts <int>, Assists <int>,
## #   Errors <int>, Salary <dbl>, NewLeague <fct>
```

---

## Introduction to `glmnet`

We will use the [`glmnet`](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin) package for ridge, lasso, and elastic net

```r
library(glmnet)
```

--
__Important__: `glmnet` does NOT use formula but instead `x` matrix of predictors and `y` vector of response

- use the `model.matrix()` function (which converts factors to 0-1 dummy variables!)

```r
model_x <- model.matrix(Salary ~ ., Hitters)[, -1]
model_y <- Hitters$Salary
```

--
What do the initial regression coefficients look like?

```r
init_reg_fit <- lm(Salary ~ ., Hitters)
coef(init_reg_fit)
```

```
##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI        Walks 
##  163.1035878   -1.9798729    7.5007675    4.3308829   -2.3762100   -1.0449620    6.2312863 
##        Years       CAtBat        CHits       CHmRun        CRuns         CRBI       CWalks 
##   -3.4890543   -0.1713405    0.1339910   -0.1728611    1.4543049    0.8077088   -0.8115709 
##      LeagueN    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
##   62.5994230 -116.8492456    0.2818925    0.3710692   -3.3607605  -24.7623251
```

---

## Ridge regression example

We use cross-validation to select `$\lambda$` with `cv.glmnet()` which uses 10-folds by default

- specify ridge regression with `alpha = 0`

```r
fit_ridge_cv <- cv.glmnet(model_x, model_y, alpha = 0)
plot(fit_ridge_cv)
```

---

## Ridge regression example

.pull-left[
Coefficients using the __1 standard error rule__ `$\lambda$`

```r
coef(fit_ridge_cv)
```

```
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 185.946732481
## AtBat         0.096634022
## Hits          0.408580477
## HmRun         1.242303538
## Runs          0.650047294
## RBI           0.642033634
## Walks         0.848737420
## Years         2.608433223
## CAtBat        0.008188531
## CHits         0.031829975
## CHmRun        0.235663247
## CRuns         0.063816873
## CRBI          0.066045116
## CWalks        0.062642350
## LeagueN       4.252099472
## DivisionW   -25.296959245
## PutOuts       0.059902888
## Assists       0.008305300
## Errors       -0.185603401
## NewLeagueN    3.676189320
```

]
.pull-right[
Or specify `$\lambda$` with minimum cross-validation error

```r
coef(fit_ridge_cv, fit_ridge_cv$lambda.min)
```

```
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  11.089425330
## AtBat        -0.007441506
## Hits          1.116358483
## HmRun        -0.155150648
## Runs          1.136603508
## RBI           0.866089829
## Walks         1.921230858
## Years        -0.774801522
## CAtBat        0.010834527
## CHits         0.070207840
## CHmRun        0.481333855
## CRuns         0.139206329
## CRBI          0.148340356
## CWalks        0.009253631
## LeagueN      30.327721487
## DivisionW   -98.193701089
## PutOuts       0.205034904
## Assists       0.052568534
## Errors       -2.091238621
## NewLeagueN    5.279536552
```

]

---

## Lasso regression example

Similar to ridge by specify `alpha = 1`:

```r
fit_lasso_cv <- cv.glmnet(model_x, model_y, alpha = 1)
plot(fit_lasso_cv)
```

---

## Lasso regression example

.pull-left[
Coefficients using the __1 standard error rule__ `$\lambda$`

```r
coef(fit_lasso_cv)
```

```
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 193.74263886
## AtBat         .         
## Hits          1.21471320
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.28957901
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.12923755
## CRBI          0.31515925
## CWalks        .         
## LeagueN       .         
## DivisionW     .         
## PutOuts       0.02533305
## Assists       .         
## Errors        .         
## NewLeagueN    .
```

]
.pull-right[
Or specify `$\lambda$` with minimum cross-validation error

```r
coef(fit_lasso_cv, fit_lasso_cv$lambda.min)
```

```
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.395005e+02
## AtBat       -1.734125e+00
## Hits         6.059861e+00
## HmRun        1.708102e-01
## Runs         .           
## RBI          .           
## Walks        5.053826e+00
## Years       -1.061379e+01
## CAtBat      -2.483637e-05
## CHits        .           
## CHmRun       5.721414e-01
## CRuns        7.218106e-01
## CRBI         3.856339e-01
## CWalks      -6.033666e-01
## LeagueN      3.339735e+01
## DivisionW   -1.193924e+02
## PutOuts      2.772560e-01
## Assists      2.104828e-01
## Errors      -2.371468e+00
## NewLeagueN   .
```

]

---

## Elastic net example

Need to tune both `$\lambda$` and `$\alpha$` - can do so manually with our own folds

```r
set.seed(2020)
fold_id <- sample(rep(1:10, length.out = nrow(model_x)))
```

Then use cross-validation with these folds for different candidate `alpha` values:

```r
cv_en_25 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .25)
cv_en_50 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = .5)
cv_ridge <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0)
cv_lasso <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 1)
```

Can see which one had the lowest CV error among its candidate `$\lambda$` values:

```r
which.min(c(min(cv_en_25$cvm), min(cv_en_50$cvm), min(cv_ridge$cvm), min(cv_lasso$cvm)))
```

```
## [1] 4
```