class: center, middle, inverse, title-slide # Machine learning ## Tree-based models ### Ron Yurko ### 07/13/2020 --- ## What is Machine Learning? -- The short version: - Machine learning (ML) is a subset of statistical learning that focuses on prediction. -- The longer version: - ML is the idea of constructing data-driven algorithms that *learn* the mapping between predictor variables and response variable(s). - We do not assume a parametric form for the mapping *a priori*, even if technically one can write one down *a posteriori* (e.g., by translating a tree model to a indicator-variable mathematical expression) - Linear regression, for instance, is not a ML algorithm since we can write down the linear equation ahead of time, but random forest is a ML algorithm since we've no idea what the number of splits will end up being in each of its individual trees. --- ## Which algorithm is best? __That's not the right question to ask.__ (And the answer is *not* deep learning. Because if the underlying relationship between your predictors and your response is truly linear, *you do not need to apply deep learning*! Just do linear regression. Really. It's OK.) -- The right question is ask is: __why should I try different algorithms?__ -- The answer to that is that without superhuman powers, you cannot visualize the distribution of predictor variables in their native space. - Of course, you can visualize these data *in projection*, for instance when we perform EDA - And the performance of different algorithms will depend on how predictor data are distributed... --- ## Data geometry <img src="http://www.stat.cmu.edu/~pfreeman/data_geometry.png" width="70%" style="display: block; margin: auto;" /> The picture above shows data for which there are two predictor variables (along the x-axis and the y-axis) and for which the response variable is binary: x's and o's. An algorithm that utilizes linear boundaries or segments the plane into rectangles will do well given the data to the left, whereas an algorithm that utilizes circular boundaries will fare better given the data to the right. "do well/fare better": will do a better job at predicting whether a new observation is actually an x or an o. --- ## Decision trees Decision trees partition training data into __homogenous nodes / subgroups__ with similar response values -- The subgroups are found __recursively using binary partitions__ - i.e. asking a series of yes-no questions about the predictor variables We stop splitting the tree once a __stopping criteria__ has been reached (e.g. maximum depth allowed) -- For each subgroup / node predictions are made with: - Regression tree: __the average of the response values__ in the node - Classification tree: __the most popular class__ in the node -- Most popular approach is Leo Breiman's __C__lassification __A__nd __R__egression __T__ree (CART) algorithm --- ## Decision tree structure <img src="https://bradleyboehmke.github.io/HOML/images/decision-tree-terminology.png" width="100%" style="display: block; margin: auto;" /> --- ## Decision tree structure We make a prediction for an observation by __following its path along the tree__ <img src="https://bradleyboehmke.github.io/HOML/images/exemplar-decision-tree.png" width="100%" style="display: block; margin: auto;" /> -- - Decision trees are __very easy to explain__ to non-statisticians. - Easy to visualize and thus easy to interpret __without assuming a parametric form__ --- ## Binary recursive partitioning CART uses __recursive__ splits where each _split / rule_ depends on the previous split / rule _above_ it -- __Objective at each split__: find the __best__ variable to partition the data into one of two regions, `\(R_1\)` & `\(R_2\)`, to __minimize the error__ between the actual response, `\(y_i\)`, and the node's predicted constant, `\(c_i\)` - For regression we minimize the sum of squared errors (SSE): `$$S S E=\sum_{i \in R_{1}}\left(y_{i}-c_{1}\right)^{2}+\sum_{i \in R_{2}}\left(y_{i}-c_{2}\right)^{2}$$` -- - For classification trees we minimize the node's _impurity_ the __Gini index__ - where `\(p_k\)` is the proportion of observations in the node belonging to class `\(k\)` out of `\(K\)` total classes - want to minimize `\(Gini\)`: small values indicate a node has primarily one class (_is more pure_) `$$Gini = 1 - \sum_k^K p_k^2$$` -- Splits yield __locally optimal__ results, so we are NOT guaranteed to train a model that is globally optimal -- _How do we control the complexity of the tree?_ --- ## Tune the __maximum tree depth__ or __minimum node size__ <img src="https://bradleyboehmke.github.io/HOML/07-decision-trees_files/figure-html/dt-early-stopping-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Prune the tree by tuning __cost complexity__ Can grow a very large complicated tree, and then __prune__ back to an optimal __subtree__ using a __cost complexity__ parameter `\(\alpha\)` (like `\(\lambda\)` for elastic net) - `\(\alpha\)` penalizes objective as a function of the number of __terminal nodes__ - e.g., we want to minimize `\(SSE + \alpha \cdot (\# \text{ of terminal nodes})\)` <img src="https://bradleyboehmke.github.io/HOML/07-decision-trees_files/figure-html/pruned-tree-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Example data: MLB 2019 batting statistics Downloaded MLB 2019 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0) ```r library(tidyverse) mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/fg_batting_2019.csv") %>% janitor::clean_names() %>% mutate_at(vars(bb_percent:k_percent), parse_number) head(mlb_data) ``` ``` ## # A tibble: 6 x 22 ## name team g pa hr r rbi sb bb_percent k_percent iso babip avg obp slg w_oba w_rc ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Mike… Ange… 134 600 45 110 104 11 18.3 20 0.353 0.298 0.291 0.438 0.645 0.436 180 ## 2 Alex… Astr… 156 690 41 122 112 5 17.2 12 0.296 0.281 0.296 0.423 0.592 0.418 168 ## 3 Chri… Brew… 130 580 44 100 97 30 13.8 20.3 0.342 0.355 0.329 0.429 0.671 0.442 174 ## 4 Cody… Dodg… 156 660 47 121 115 15 14.4 16.4 0.324 0.302 0.305 0.406 0.629 0.415 162 ## 5 Marc… Athl… 162 747 33 123 92 10 11.6 13.7 0.237 0.294 0.285 0.369 0.522 0.373 137 ## 6 Kete… Diam… 144 628 32 97 92 10 8.4 13.7 0.264 0.342 0.329 0.389 0.592 0.405 150 ## # … with 5 more variables: bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl> ``` --- ## Regression tree example with the [`rpart` package](https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf) Revisit the modeling of `w_oba` from the [KNN slides](http://www.stat.cmu.edu/cmsac/sure/materials/lectures/slides/15-knn-kernel.html#21) ```r library(rpart) init_mlb_tree <- rpart(formula = w_oba ~ bb_percent + k_percent + iso, data = mlb_data, method = "anova") init_mlb_tree ``` ``` ## n= 135 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 135 0.151064900 0.3455778 ## 2) iso< 0.226 79 0.043023370 0.3265696 ## 4) iso< 0.1885 49 0.016439350 0.3158163 ## 8) iso< 0.1285 12 0.003165000 0.2975000 * ## 9) iso>=0.1285 37 0.007942811 0.3217568 * ## 5) iso>=0.1885 30 0.011663470 0.3441333 * ## 3) iso>=0.226 56 0.039231360 0.3723929 ## 6) iso< 0.294 47 0.021281320 0.3655957 ## 12) k_percent>=26.55 7 0.003192000 0.3420000 * ## 13) k_percent< 26.55 40 0.013509980 0.3697250 ## 26) k_percent>=14.65 33 0.008832909 0.3658182 ## 52) iso< 0.2375 7 0.001612000 0.3510000 * ## 53) iso>=0.2375 26 0.005270038 0.3698077 * ## 27) k_percent< 14.65 7 0.001798857 0.3881429 * ## 7) iso>=0.294 9 0.004438889 0.4078889 * ``` --- ## Display the tree with [`rpart.plot`](plhttp://www.milbo.org/rpart-plot/) .pull-left[ ```r library(rpart.plot) rpart.plot(init_mlb_tree) ``` <img src="17-trees-rf_files/figure-html/plot-tree-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ - `rpart()` runs 10-fold CV to tune `\(\alpha\)` for pruning - Selects # terminal nodes via 1 SE rule ```r plotcp(init_mlb_tree) ``` <img src="17-trees-rf_files/figure-html/plot-complexity-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## What about the full tree? (check out `rpart.control`) .pull-left[ ```r full_mlb_tree <- rpart(formula = w_oba ~ bb_percent + k_percent + iso, data = mlb_data, method = "anova", control = list(cp = 0, xval = 10)) rpart.plot(full_mlb_tree) ``` <img src="17-trees-rf_files/figure-html/plot-full-tree-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ ```r plotcp(full_mlb_tree) ``` <img src="17-trees-rf_files/figure-html/plot-full-complexity-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## Train with `caret` ```r library(caret) caret_mlb_tree <- train(w_oba ~ bb_percent + k_percent + iso + avg + obp + slg + war, data = mlb_data, method = "rpart", trControl = trainControl(method = "cv", number = 10), tuneLength = 20) ggplot(caret_mlb_tree) + theme_bw() ``` <img src="17-trees-rf_files/figure-html/caret-tree-1.png" width="504" style="display: block; margin: auto;" /> --- ## Display the final model ```r rpart.plot(caret_mlb_tree$finalModel) ``` <img src="17-trees-rf_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> --- ## Preview of summarizing variables in tree-based models .pull-left[ __Variable importance__ - based on reduction in SSE (_notice anything odd?_) ```r library(vip) vip(caret_mlb_tree, bar = FALSE) + theme_bw() ``` <img src="17-trees-rf_files/figure-html/var-imp-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ - Summarize single variable's relationship with __partial dependence plot__ ```r library(pdp) partial(caret_mlb_tree, pred.var = "obp") %>% autoplot() + theme_bw() ``` <img src="17-trees-rf_files/figure-html/pdp-1.png" width="504" style="display: block; margin: auto;" /> ]