class: center, middle, inverse, title-slide # Machine learning ## Random forests and gradient-boosted trees ### July 14th, 2021 --- ## Decision trees review Decision trees partition training data into __homogenous nodes / subgroups__ with similar response values __Pros__ - Decision trees are __very easy to explain__ to non-statisticians. - Easy to visualize and thus easy to interpret __without assuming a parametric form__ -- __Cons__ - High variance, i.e. split a dataset in half and grow tress in each half, the result will be very different - Related note - __they generalize poorly resulting in higher test set error rates__ -- But there are several ways we can overcome this via __ensemble models__ --- ## Bagging __Bootstrap aggregation__ (aka bagging) is a general approach for overcoming high variance -- - __Bootstrap__: sample the training data _with replacement_ <img src="https://bradleyboehmke.github.io/HOML/images/bootstrap-scheme.png" width="60%" style="display: block; margin: auto;" /> -- - __Aggregation__: Combine the results from many trees together, each constructed with a different bootstrapped sample of the data --- ## Bagging algorithm Start with a __specified number of trees `\(B\)`__: -- - For each tree `\(b\)` in `\(1, \dots, B\)`: - Construct a bootstrap sample from the training data -- - Grow a deep, unpruned, complicated (aka really overfit!) tree -- To generate a prediction for a new point: - __Regression__: take the __average__ across the `\(B\)` trees -- - __Classification__: take the __majority vote__ across the `\(B\)` trees - assuming each tree predicts a single class (could use probabilities instead...) -- Improves prediction accuracy via __wisdom of the crowds__ - but at the expense of interpretability - Easy to read one tree, but how do you read `\(B = 500\)`? -- But we can still use the measures of __variable importance__ and __partial dependence__ to summarize our models --- ## Random forests algorithm Random forests are __an extension of bagging__ -- - For each tree `\(b\)` in `\(1, \dots, B\)`: - Construct a bootstrap sample from the training data -- - Grow a deep, unpruned, complicated (aka really overfit!) tree __but with a twist__ -- - __At each split__: limit the variables considered to a __random subset__ `\(m_{try}\)` of original `\(p\)` variables -- Predictions are made the same way as bagging: - __Regression__: take the __average__ across the `\(B\)` trees - __Classification__: take the __majority vote__ across the `\(B\)` trees -- __Split-variable randomization__ adds more randomness to make __each tree more independent of each other__ -- Introduce `\(m_{try}\)` as a tuning parameter: typically use `\(p / 3\)` (regression) or `\(\sqrt{p}\)` (classification) - `\(m_{try} = p\)` -- is bagging --- ## Example data: MLB 2021 batting statistics Downloaded MLB 2021 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2021&month=0&season1=2021&ind=0) ```r library(tidyverse) mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/fg_batting_2021.csv") %>% janitor::clean_names() %>% mutate_at(vars(bb_percent:k_percent), parse_number) model_mlb_data <- mlb_data %>% dplyr::select(-name, -team, -playerid) head(model_mlb_data) ``` ``` ## # A tibble: 6 x 20 ## g pa hr r rbi sb bb_percent k_percent ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 82 354 27 66 69 2 14.4 17.2 ## 2 68 288 27 66 58 18 12.5 28.1 ## 3 79 347 16 61 52 0 13.5 17 ## 4 82 372 21 63 54 10 8.9 23.9 ## 5 78 342 23 67 51 16 13.2 24.3 ## 6 82 322 31 60 67 12 11.2 28 ## # … with 12 more variables: iso <dbl>, babip <dbl>, ## # avg <dbl>, obp <dbl>, slg <dbl>, w_oba <dbl>, ## # xw_oba <dbl>, w_rc <dbl>, bs_r <dbl>, off <dbl>, ## # def <dbl>, war <dbl> ``` --- ## Example using [`ranger`](https://github.com/imbs-hl/ranger) `ranger` package is a popular / fast implementation ([see `randomForest`](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) for the original) ```r library(ranger) init_mlb_rf <- ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity") init_mlb_rf ``` ``` ## Ranger result ## ## Call: ## ranger(war ~ ., data = model_mlb_data, num.trees = 50, importance = "impurity") ## ## Type: Regression ## Number of trees: 50 ## Sample size: 135 ## Number of independent variables: 19 ## Mtry: 4 ## Target node size: 5 ## Variable importance mode: impurity ## Splitrule: variance ## OOB prediction error (MSE): 0.1646824 ## R squared (OOB): 0.8554191 ``` --- ## Out-of-bag estimate Since the trees are constructed via bootstrapped data (samples with replacements) - each sample _is likely to have duplicate observations / rows_ __Out-of-bag (OOB)__ - original observations not contained in a single bootstrap sample - Can use the OOB samples to estimate predictive performance (OOB becomes better with larger datasets) - On average `\(\approx 63\)`% of original data ends up in any particular bootstrap sample --- ## Variable importance ```r library(vip) vip(init_mlb_rf, geom = "point") + theme_bw() ``` <img src="21-Tree-ensembles_files/figure-html/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- ## Tuning random forests Unfortunately `caret` does not let you know tune number of trees - typically the error goes down with more (_Exercise: check out CV performance as a function of the number trees on your own, compare with OOB error_) .pull-left[ - __Important__: `\(m_{try}\)` - Marginal: tree complexity, splitting rule, sampling scheme ```r library(caret) rf_tune_grid <- expand.grid(mtry = seq(3, 18, by = 3), splitrule = "variance", min.node.size = 5) set.seed(1917) caret_mlb_rf <- train(war ~ ., data = model_mlb_data, method = "ranger", num.trees = 50, trControl = trainControl(method = "cv", number = 5), tuneGrid = rf_tune_grid) ``` ] .pull-right[ ```r ggplot(caret_mlb_rf) + theme_bw() ``` <img src="21-Tree-ensembles_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Boosting Build ensemble models __sequentially__ -- - start with a __weak learner__, e.g. small decision tree with few splits -- - each model in the sequence _slightly_ improves upon the predictions of the previous models __by focusing on the observations with the largest errors / residuals__ <img src="https://bradleyboehmke.github.io/HOML/images/boosted-trees-process.png" width="80%" style="display: block; margin: auto;" /> --- ## Boosted trees algorithm Write the prediction at step `\(t\)` of the search as `\(\hat{y}_i^{(t)}\)`, start with `\(\hat{y}_i^{(0)} = 0\)` - Fit the first decision tree `\(f_1\)` to the data: `\(\hat{y}_i^{(1)} = f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\)` -- - Fit the next tree `\(f_2\)` to the residuals of the previous: `\(y_i - \hat{y}_i^{(1)}\)` - Add this to the prediction: `\(\hat{y}_i^{(2)} = \hat{y}_i^{(1)} + f_2(x_i) = f_1(x_i) + f_2(x_i)\)` -- - Fit the next tree `\(f_3\)` to the residuals of the previous: `\(y_i - \hat{y}_i^{(2)}\)` - Add this to the prediction: `\(\hat{y}_i^{(3)} = \hat{y}_i^{(2)} + f_3(x_i) = f_1(x_i) + f_2(x_i) + f_3(x_i)\)` -- __Continue until some stopping criteria__ to reach final model as a __sum of trees__: `$$\hat{y_i} = f(x_i) = \sum_{b=1}^B f_b(x_i)$$` --- ## Visual example of boosting in action <img src="https://bradleyboehmke.github.io/HOML/10-gradient-boosting_files/figure-html/boosting-in-action-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Gradient boosted trees Regression boosting algorithm can be generalized to other loss functions via __gradient descent__ - leading to gradient boosted trees, aka __gradient boosting machines (GBMs)__ Update the model parameters in the direction of the loss function's descending gradient <img src="https://bradleyboehmke.github.io/HOML/10-gradient-boosting_files/figure-html/gradient-descent-fig-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Tune the learning rate in gradient descent We need to control how much we update by in each step - __the learning rate__ <img src="https://bradleyboehmke.github.io/HOML/10-gradient-boosting_files/figure-html/learning-rate-fig-1.png" width="100%" style="display: block; margin: auto;" /> --- ### Stochastic gradient descent can help with complex loss functions Can take random samples of the data when updating - makes algorithm faster and adds randomness to get closer to global minimum (no guarantees!) <img src="https://bradleyboehmke.github.io/HOML/10-gradient-boosting_files/figure-html/stochastic-gradient-descent-fig-1.png" width="100%" style="display: block; margin: auto;" /> --- ## eXtreme gradient boosting with [XGBoost](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) <img src="https://media1.tenor.com/images/6e86cad4899cd02047c33ecfc0e4052b/tenor.gif?itemid=11446810" width="80%" style="display: block; margin: auto;" /> --- ## Tuning GBMs with [`xgboost`](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) __XGBoost__ (extreme gradient boosting) is a very powerful, efficient boosting library that is available to use within `R` via the [`xgboost`](https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html) package -- What we have to consider tuning (our __hyperparameters__): - number of trees `\(B\)` (`nrounds`) - learning rate (`eta`), i.e. how much we update in each step - these two really have to be tuned together -- - complexity of the trees (depth, number of observations in nodes) -- - XGBoost also provides more __regularization__ (via `gamma`) and early stopping -- __More work to tune properly as compared to random forests__ - But GBMs have more flexibility in their usage for particular objective functions - _Insert with great power comes great responsibility meme_ --- ## XGBoost example ```r library(xgboost) xgboost_tune_grid <- expand.grid(nrounds = seq(from = 20, to = 200, by = 20), eta = c(0.025, 0.05, 0.1, 0.3), gamma = 0, max_depth = c(1, 2, 3, 4), colsample_bytree = 1, min_child_weight = 1, subsample = 1) xgboost_tune_control <- trainControl(method = "cv", number = 5, verboseIter = FALSE) set.seed(1937) xgb_tune <- train(x = as.matrix(dplyr::select(model_mlb_data, -war)), y = model_mlb_data$war, trControl = xgboost_tune_control, tuneGrid = xgboost_tune_grid, objective = "reg:squarederror", method = "xgbTree", verbose = TRUE) xgb_tune$bestTune ``` ``` ## nrounds max_depth eta gamma colsample_bytree ## 130 200 1 0.3 0 1 ## min_child_weight subsample ## 130 1 1 ``` --- ## XGBoost example ```r xgb_fit_final <- xgboost(data = as.matrix(dplyr::select(model_mlb_data, -war)), label = model_mlb_data$war, objective = "reg:squarederror", nrounds = xgb_tune$bestTune$nrounds, params = as.list(dplyr::select(xgb_tune$bestTune, -nrounds)), verbose = 0) vip(xgb_fit_final) + theme_bw() ``` <img src="21-Tree-ensembles_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> --- ## XGBoost example ```r library(pdp) partial(xgb_fit_final, pred.var = "off", train = as.matrix(dplyr::select(model_mlb_data, -war)), plot.engine = "ggplot2", plot = TRUE, type = "regression") + theme_bw() ``` <img src="21-Tree-ensembles_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" />