Machine learning

class: center, middle, inverse, title-slide

# Machine learning
## Random forests and gradient-boosted trees
### July 14th, 2021

---

## Decision trees review

Decision trees partition training data into __homogenous nodes / subgroups__ with similar response values

__Pros__

- Decision trees are __very easy to explain__ to non-statisticians.

- Easy to visualize and thus easy to interpret __without assuming a parametric form__

__Cons__

- High variance, i.e. split a dataset in half and grow tress in each half, the result will be very different

- Related note - __they generalize poorly resulting in higher test set error rates__

But there are several ways we can overcome this via __ensemble models__

---

## Bagging

__Bootstrap aggregation__ (aka bagging) is a general approach for overcoming high variance

--
- __Bootstrap__: sample the training data _with replacement_

--
- __Aggregation__: Combine the results from many trees together, each constructed with a different bootstrapped sample of the data

---

## Bagging algorithm

Start with a __specified number of trees `$B$`__:

--
- For each tree `$b$` in `$1, \dots, B$`:

- Construct a bootstrap sample from the training data
  
  
--
  - Grow a deep, unpruned, complicated (aka really overfit!) tree

--
To generate a prediction for a new point:

- __Regression__: take the __average__ across the `$B$` trees

--
- __Classification__: take the __majority vote__ across the `$B$` trees

- assuming each tree predicts a single class (could use probabilities instead...)

--
Improves prediction accuracy via __wisdom of the crowds__ - but at the expense of interpretability

- Easy to read one tree, but how do you read `$B = 500$`?

--
But we can still use the measures of __variable importance__ and __partial dependence__ to summarize our models

---

## Random forests algorithm

Random forests are __an extension of bagging__

--
- For each tree `$b$` in `$1, \dots, B$`:

- Construct a bootstrap sample from the training data
  
  
--
  - Grow a deep, unpruned, complicated (aka really overfit!) tree __but with a twist__
  
  
--
  - __At each split__: limit the variables considered to a __random subset__ `$m_{try}$` of original `$p$` variables

--
Predictions are made the same way as bagging:

- __Regression__: take the __average__ across the `$B$` trees

- __Classification__: take the __majority vote__ across the `$B$` trees

--
__Split-variable randomization__ adds more randomness to make __each tree more independent of each other__

--
Introduce `$m_{try}$` as a tuning parameter: typically use `$p / 3$` (regression) or `$\sqrt{p}$` (classification)
- `$m_{try} = p$`
--
 is bagging

---

## Example data: MLB 2021 batting statistics

Downloaded MLB 2021 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0)

```r
library(tidyverse)
mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/fg_batting_2021.csv") %>%
  janitor::clean_names() %>%
  mutate_at(vars(bb_percent:k_percent), parse_number)
model_mlb_data <- mlb_data %>%
  dplyr::select(-name, -team, -playerid)
head(model_mlb_data)
```

```
## # A tibble: 6 x 20
##       g    pa    hr     r   rbi    sb bb_percent k_percent
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>
## 1    82   354    27    66    69     2       14.4      17.2
## 2    68   288    27    66    58    18       12.5      28.1
## 3    79   347    16    61    52     0       13.5      17  
## 4    82   372    21    63    54    10        8.9      23.9
## 5    78   342    23    67    51    16       13.2      24.3
## 6    82   322    31    60    67    12       11.2      28  
## # … with 12 more variables: iso <dbl>, babip <dbl>,
## #   avg <dbl>, obp <dbl>, slg <dbl>, w_oba <dbl>,
## #   xw_oba <dbl>, w_rc <dbl>, bs_r <dbl>, off <dbl>,
## #   def <dbl>, war <dbl>
```

---

## Example using [`ranger`](https://github.com/imbs-hl/ranger)

`ranger` package is a popular / fast implementation ([see `randomForest`](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) for the original)

```r
library(ranger)
init_mlb_rf <- ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity")
init_mlb_rf
```

```
## Ranger result
## 
## Call:
##  ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      135 
## Number of independent variables:  19 
## Mtry:                             4 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.1646824 
## R squared (OOB):                  0.8554191
```

---

## Out-of-bag estimate

Since the trees are constructed via bootstrapped data (samples with replacements) - each sample _is likely to have duplicate observations / rows_

__Out-of-bag (OOB)__ - original observations not contained in a single bootstrap sample

- Can use the OOB samples to estimate predictive performance (OOB becomes better with larger datasets)

- On average `$\approx 63$`% of original data ends up in any particular bootstrap sample

---

## Variable importance

```r
library(vip)
vip(init_mlb_rf, geom = "point") + theme_bw()
```

---

## Tuning random forests

Unfortunately `caret` does not let you know tune number of trees - typically the error goes down with more (_Exercise: check out CV performance as a function of the number trees on your own, compare with OOB error_)

.pull-left[
- __Important__: `$m_{try}$`

- Marginal: tree complexity, splitting rule, sampling scheme

```r
library(caret)
rf_tune_grid <- 
  expand.grid(mtry = seq(3, 18, by = 3), 
              splitrule = "variance",
              min.node.size = 5)
set.seed(1917)
caret_mlb_rf <- 
  train(war ~ ., data = model_mlb_data,
        method = "ranger", num.trees = 50,
        trControl = trainControl(method = "cv", number = 5),
        tuneGrid = rf_tune_grid)
```

]
.pull-right[

```r
ggplot(caret_mlb_rf) + theme_bw()
```

]

---

## Boosting

Build ensemble models __sequentially__

--
- start with a __weak learner__, e.g. small decision tree with few splits

--
- each model in the sequence _slightly_ improves upon the predictions of the previous models __by focusing on the observations with the largest errors / residuals__

---

## Boosted trees algorithm

Write the prediction at step `$t$` of the search as `$\hat{y}_i^{(t)}$`, start with `$\hat{y}_i^{(0)} = 0$`

- Fit the first decision tree `$f_1$` to the data: `$\hat{y}_i^{(1)} = f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)$`

--
- Fit the next tree `$f_2$` to the residuals of the previous: `$y_i - \hat{y}_i^{(1)}$`

- Add this to the prediction: `$\hat{y}_i^{(2)} = \hat{y}_i^{(1)} + f_2(x_i) = f_1(x_i) + f_2(x_i)$`

--
- Fit the next tree `$f_3$` to the residuals of the previous: `$y_i - \hat{y}_i^{(2)}$`

- Add this to the prediction:  `$\hat{y}_i^{(3)} = \hat{y}_i^{(2)} + f_3(x_i) = f_1(x_i) + f_2(x_i) + f_3(x_i)$`

--
__Continue until some stopping criteria__ to reach final model as a __sum of trees__:

`$$\hat{y_i} = f(x_i) = \sum_{b=1}^B f_b(x_i)$$`

---

## Visual example of boosting in action

---

## Gradient boosted trees

Regression boosting algorithm can be generalized to other loss functions via __gradient descent__ - leading to gradient boosted trees, aka __gradient boosting machines (GBMs)__

Update the model parameters in the direction of the loss function's descending gradient

---

## Tune the learning rate in gradient descent

We need to control how much we update by in each step - __the learning rate__

---

### Stochastic gradient descent can help with complex loss functions

Can take random samples of the data when updating - makes algorithm faster and adds randomness to get closer to global minimum (no guarantees!)

---

## eXtreme gradient boosting with [XGBoost](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html)

---

## Tuning GBMs with [`xgboost`](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html)

__XGBoost__ (extreme gradient boosting) is a very powerful, efficient boosting library that is available to use within `R` via the [`xgboost`](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) package

--
What we have to consider tuning (our __hyperparameters__):

- number of trees `$B$` (`nrounds`)

- learning rate (`eta`), i.e. how much we update in each step

- these two really have to be tuned together

--
- complexity of the trees (depth, number of observations in nodes)

--
- XGBoost also provides more __regularization__ (via `gamma`) and early stopping

--
__More work to tune properly as compared to random forests__

- But GBMs have more flexibility in their usage for particular objective functions

- _Insert with great power comes great responsibility meme_

---

## XGBoost example

```r
library(xgboost)
xgboost_tune_grid <- expand.grid(nrounds = seq(from = 20, to = 200, by = 20),
                                 eta = c(0.025, 0.05, 0.1, 0.3), gamma = 0,
                                 max_depth = c(1, 2, 3, 4), colsample_bytree = 1,
                                 min_child_weight = 1, subsample = 1)
xgboost_tune_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
set.seed(1937)
xgb_tune <- train(x = as.matrix(dplyr::select(model_mlb_data, -war)),
                  y = model_mlb_data$war, trControl = xgboost_tune_control,
                  tuneGrid = xgboost_tune_grid, 
                  objective = "reg:squarederror", method = "xgbTree",
                  verbose = TRUE)
xgb_tune$bestTune
```

```
##     nrounds max_depth eta gamma colsample_bytree
## 130     200         1 0.3     0                1
##     min_child_weight subsample
## 130                1         1
```

---

## XGBoost example

```r
xgb_fit_final <- xgboost(data = as.matrix(dplyr::select(model_mlb_data, -war)),
                         label = model_mlb_data$war, objective = "reg:squarederror",
                         nrounds = xgb_tune$bestTune$nrounds,
                         params = as.list(dplyr::select(xgb_tune$bestTune,
                                                        -nrounds)), 
                         verbose = 0)
vip(xgb_fit_final) + theme_bw()
```

---

## XGBoost example

```r
library(pdp)
partial(xgb_fit_final, pred.var = "off", train = as.matrix(dplyr::select(model_mlb_data, -war)),
        plot.engine = "ggplot2", plot = TRUE,
        type = "regression") + theme_bw()
```