class: center, middle, inverse, title-slide # Machine learning ## Random forests and gradient-boosted trees ### Ron Yurko ### 07/15/2020 --- ## Decision trees review Decision trees partition training data into __homogenous nodes / subgroups__ with similar response values __Pros__ - Decision trees are __very easy to explain__ to non-statisticians. - Easy to visualize and thus easy to interpret __without assuming a parametric form__ -- __Cons__ - High variance, i.e. split a dataset in half and grow tress in each half, the result will be very different - Related note - __they generalize poorly resulting in higher test set error rates__ -- But there are several ways we can overcome this via __ensemble models__ --- ## Bagging __Bootstrap aggregation__ (aka bagging) is a general approach for overcoming high variance -- - __Bootstrap__: sample the training data _with replacement_ <img src="" width="60%" style="display: block; margin: auto;" /> -- - __Aggregation__: Combine the results from many trees together, each constructed with a different bootstrapped sample of the data --- ## Bagging algorithm Start with a __specified number of trees `\(B\)`__: -- - For each tree `\(b\)` in `\(1, \dots, B\)`: - Construct a bootstrap sample from the training data -- - Grow a deep, unpruned, complicated (aka really overfit!) tree -- To generate a prediction for a new point: - __Regression__: take the __average__ across the `\(B\)` trees -- - __Classification__: take the __majority vote__ across the `\(B\)` trees - assuming each tree predicts a single class (could use probabilities instead...) -- Improves prediction accuracy via __wisdom of the crowds__ - but at the expense of interpretability - Easy to read one tree, but how do you read `\(B = 500\)`? -- But we can still use the measures of __variable importance__ and __partial dependence__ to summarize our models --- ## Random forests algorithm Random forests are __an extension of bagging__ -- - For each tree `\(b\)` in `\(1, \dots, B\)`: - Construct a bootstrap sample from the training data -- - Grow a deep, unpruned, complicated (aka really overfit!) tree __but with a twist__ -- - __At each split__: limit the variables considered to a __random subset__ `\(m_{try}\)` of original `\(p\)` variables -- Predictions are made the same way as bagging: - __Regression__: take the __average__ across the `\(B\)` trees - __Classification__: take the __majority vote__ across the `\(B\)` trees -- __Split-variable randomization__ adds more randomness to make __each tree more independent of each other__ -- Introduce `\(m_{try}\)` as a tuning parameter: typically use `\(p / 3\)` (regression) or `\(\sqrt{p}\)` (classification) - `\(m_{try} = p\)` -- is bagging --- ## Example data: MLB 2019 batting statistics Downloaded MLB 2019 batting statistics leaderboard from [Fangraphs]( ```r library(tidyverse) mlb_data <- read_csv("") %>% janitor::clean_names() %>% mutate_at(vars(bb_percent:k_percent), parse_number) model_mlb_data <- mlb_data %>% dplyr::select(-name, -team, -playerid) head(model_mlb_data) ``` ``` ## # A tibble: 6 x 19 ## g pa hr r rbi sb bb_percent k_percent iso babip avg obp slg w_oba w_rc bs_r off ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 134 600 45 110 104 11 18.3 20 0.353 0.298 0.291 0.438 0.645 0.436 180 7.1 68.2 ## 2 156 690 41 122 112 5 17.2 12 0.296 0.281 0.296 0.423 0.592 0.418 168 -2.1 58.2 ## 3 130 580 44 100 97 30 13.8 20.3 0.342 0.355 0.329 0.429 0.671 0.442 174 8.5 65.2 ## 4 156 660 47 121 115 15 14.4 16.4 0.324 0.302 0.305 0.406 0.629 0.415 162 1.4 55.3 ## 5 162 747 33 123 92 10 11.6 13.7 0.237 0.294 0.285 0.369 0.522 0.373 137 1.7 37.5 ## 6 144 628 32 97 92 10 8.4 13.7 0.264 0.342 0.329 0.389 0.592 0.405 150 4.2 46 ## # … with 2 more variables: def <dbl>, war <dbl> ``` --- ## Example using [`ranger`]( `ranger` package is a popular / fast implementation ([see `randomForest`]( for the original) ```r library(ranger) init_mlb_rf <- ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity") init_mlb_rf ``` ``` ## Ranger result ## ## Call: ## ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity") ## ## Type: Regression ## Number of trees: 50 ## Sample size: 135 ## Number of independent variables: 18 ## Mtry: 4 ## Target node size: 5 ## Variable importance mode: impurity ## Splitrule: variance ## OOB prediction error (MSE): 0.6561859 ## R squared (OOB): 0.8387718 ``` --- ## Out-of-bag estimate Since the trees are constructed via bootstrapped data (samples with replacements) - each sample _is likely to have duplicate observations / rows_ __Out-of-bag (OOB)__ - original observations not contained in a single bootstrap sample - Can use the OOB samples to estimate predictive performance (OOB becomes better with larger datasets) - On average `\(\approx 63\)`% of original data ends up in any particular bootstrap sample --- ## Variable importance ```r library(vip) vip(init_mlb_rf, geom = "point") + theme_bw() ``` <img src="18-rf-boosting_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- ## Tuning random forests Unfortunately `caret` does not let you know tune number of trees - typically the error goes down with more (_Exercise: check out CV performance as a function of the number trees on your own, compare with OOB error_) .pull-left[ - __Important__: `\(m_{try}\)` - Marginal: tree complexity, splitting rule, sampling scheme ```r library(caret) rf_tune_grid <- expand.grid(mtry = seq(3, 18, by = 3), splitrule = "variance", min.node.size = 5) set.seed(1917) caret_mlb_rf <- train(war ~ ., data = model_mlb_data, method = "ranger", num.trees = 50, trControl = trainControl(method = "cv", number = 5), tuneGrid = rf_tune_grid) ``` ] .pull-right[ ```r plot(caret_mlb_rf) ``` <img src="18-rf-boosting_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Boosting Build ensemble models __sequentially__ -- - start with a __weak learner__, e.g. small decision tree with few splits -- - each model in the sequence _slightly_ improves upon the predictions of the previous models __by focusing on the observations with the largest errors / residuals__ <img src="" width="80%" style="display: block; margin: auto;" /> --- ## Boosted trees algorithm Write the prediction at step `\(t\)` of the search as `\(\hat{y}_i^{(t)}\)`, start with `\(\hat{y}_i^{(0)} = 0\)` - Fit the first decision tree `\(f_1\)` to the data: `\(\hat{y}_i^{(1)} = f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\)` -- - Fit the next tree `\(f_2\)` to the residuals of the previous: `\(y_i - \hat{y}_i^{(1)}\)` - Add this to the prediction: `\(\hat{y}_i^{(2)} = \hat{y}_i^{(1)} + f_2(x_i) = f_1(x_i) + f_2(x_i)\)` -- - Fit the next tree `\(f_3\)` to the residuals of the previous: `\(y_i - \hat{y}_i^{(2)}\)` - Add this to the prediction: `\(\hat{y}_i^{(3)} = \hat{y}_i^{(2)} + f_3(x_i) = f_1(x_i) + f_2(x_i) + f_3(x_i)\)` -- __Continue until some stopping criteria__ to reach final model as a __sum of trees__: `$$\hat{y_i} = f(x_i) = \sum_{b=1}^B f_b(x_i)$$` --- ## Visual example of boosting in action <img src="" width="80%" style="display: block; margin: auto;" /> --- ## Gradient boosted trees Regression boosting algorithm can be generalized to other loss functions via __gradient descent__ - leading to gradient boosted trees, aka __gradient boosting machines (GBMs)__ Update the model parameters in the direction of the loss function's descending gradient <img src="" width="60%" style="display: block; margin: auto;" /> --- ## Tune the learning rate in gradient descent We need to control how much we update by in each step - __the learning rate__ <img src="" width="100%" style="display: block; margin: auto;" /> --- ## Stochastic gradient descent can help with complex loss functions Can take random samples of the data when updating - makes algorithm faster and adds randomness to get closer to global minimum (no guarantees!) <img src="" width="100%" style="display: block; margin: auto;" /> --- ## eXtreme gradient boosting with [XGBoost]( <img src="" width="80%" style="display: block; margin: auto;" /> --- ## Tuning GBMs with [`xgboost`]( __XGBoost__ (extreme gradient boosting) is a very powerful, efficient boosting library that is available to use within `R` via the [`xgboost`]( package -- What we have to consider tuning (our __hyperparameters__): - number of trees `\(B\)` (`nrounds`) - learning rate (`eta`), i.e. how much we update in each step - these two really have to be tuned together -- - complexity of the trees (depth, number of observations in nodes) -- - XGBoost also provides more __regularization__ (via `gamma`) and early stopping -- __More work to tune properly as compared to random forests__ - But GBMs have more flexibility in their usage for particular objective functions - _Insert with great power comes great responsibility meme_ --- ## XGBoost example ```r library(xgboost) xgboost_tune_grid <- expand.grid(nrounds = seq(from = 20, to = 200, by = 20), eta = c(0.025, 0.05, 0.1, 0.3), gamma = 0, max_depth = c(1, 2, 3, 4), colsample_bytree = 1, min_child_weight = 1, subsample = 1) xgboost_tune_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE) set.seed(1937) xgb_tune <- train(x = as.matrix(dplyr::select(model_mlb_data, -war)), y = model_mlb_data$war, trControl = xgboost_tune_control, tuneGrid = xgboost_tune_grid, method = "xgbTree", verbose = TRUE) xgb_tune$bestTune ``` ``` ## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample ## 130 200 1 0.3 0 1 1 1 ``` --- ## XGBoost example ```r xgb_fit_final <- xgboost(data = as.matrix(dplyr::select(model_mlb_data, -war)), label = model_mlb_data$war, objective = "reg:linear", nrounds = xgb_tune$bestTune$nrounds, params = as.list(dplyr::select(xgb_tune$bestTune, -nrounds)), verbose = 0) vip(xgb_fit_final) + theme_bw() ``` <img src="18-rf-boosting_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> --- ## XGBoost example ```r library(pdp) partial(xgb_fit_final, pred.var = "off", train = as.matrix(dplyr::select(model_mlb_data, -war)), plot.engine = "ggplot2", plot = TRUE) + theme_bw() ``` <img src="18-rf-boosting_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" />