Machine learning

class: center, middle, inverse, title-slide

# Machine learning
## Tree-based models
### Ron Yurko
### 07/13/2020

---

## What is Machine Learning?

--
The short version:

- Machine learning (ML) is a subset of statistical learning that focuses on prediction.

--
The longer version:

- ML is the idea of constructing data-driven algorithms that *learn* the mapping between predictor variables and response variable(s).

- We do not assume a parametric form for the mapping *a priori*, even if technically one can write one down *a posteriori* (e.g., by translating a tree model to a indicator-variable mathematical expression)

- Linear regression, for instance, is not a ML algorithm since we can write down the linear equation ahead of time, but random forest is a ML algorithm since we've no idea what the number of splits will end up being in each of its individual trees.

---

## Which algorithm is best?

__That's not the right question to ask.__

(And the answer is *not* deep learning. Because if the underlying relationship between your predictors and your response is truly linear, *you do not need to apply deep learning*! Just do linear regression. Really. It's OK.)

--
The right question is ask is: __why should I try different algorithms?__

--
The answer to that is that without superhuman powers, you cannot visualize the distribution of predictor variables in their native space.

- Of course, you can visualize these data *in projection*, for instance when we perform EDA
  
  - And the performance of different algorithms will depend on how predictor data are distributed...

---

## Data geometry

The picture above shows data for which there are two predictor variables (along the x-axis and the y-axis) and for which the response variable is binary: x's and o's. An algorithm that utilizes linear boundaries or segments the plane into rectangles will do well given the data to the left, whereas an algorithm that utilizes circular boundaries will fare better given the data to the right.

"do well/fare better": will do a better job at predicting whether a new observation is actually an x or an o.

---

## Decision trees

Decision trees partition training data into __homogenous nodes / subgroups__ with similar response values

--
The subgroups are found __recursively using binary partitions__

- i.e. asking a series of yes-no questions about the predictor variables

We stop splitting the tree once a __stopping criteria__ has been reached (e.g. maximum depth allowed)

--
For each subgroup / node predictions are made with:

- Regression tree: __the average of the response values__ in the node

- Classification tree: __the most popular class__ in the node

--
Most popular approach is Leo Breiman's __C__lassification __A__nd __R__egression __T__ree (CART) algorithm

---

## Decision tree structure

---

## Decision tree structure

We make a prediction for an observation by __following its path along the tree__

- Decision trees are __very easy to explain__ to non-statisticians.

- Easy to visualize and thus easy to interpret __without assuming a parametric form__

---

## Binary recursive partitioning

CART uses __recursive__ splits where each _split / rule_ depends on the previous split / rule _above_ it

--
__Objective at each split__: find the __best__ variable to partition the data into one of two regions, `$R_1$` & `$R_2$`, to __minimize the error__ between the actual response, `$y_i$`, and the node's predicted constant, `$c_i$`

- For regression we minimize the sum of squared errors (SSE):

`$$S S E=\sum_{i \in R_{1}}\left(y_{i}-c_{1}\right)^{2}+\sum_{i \in R_{2}}\left(y_{i}-c_{2}\right)^{2}$$`

--
- For classification trees we minimize the node's _impurity_ the __Gini index__

- where `$p_k$` is the proportion of observations in the node belonging to class `$k$` out of `$K$` total classes
  
  - want to minimize `$Gini$`: small values indicate a node has primarily one class (_is more pure_)

`$$Gini = 1 - \sum_k^K p_k^2$$`

--
Splits yield __locally optimal__ results, so we are NOT guaranteed to train a model that is globally optimal

--
_How do we control the complexity of the tree?_

---

## Tune the __maximum tree depth__ or __minimum node size__

---

## Prune the tree by tuning __cost complexity__

Can grow a very large complicated tree, and then __prune__ back to an optimal __subtree__ using a __cost complexity__ parameter `$\alpha$` (like `$\lambda$` for elastic net)

- `$\alpha$` penalizes objective as a function of the number of __terminal nodes__

- e.g., we want to minimize `$SSE + \alpha \cdot (\# \text{  of terminal nodes})$`

---

## Example data: MLB 2019 batting statistics

Downloaded MLB 2019 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0)

```r
library(tidyverse)
mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/fg_batting_2019.csv") %>%
  janitor::clean_names() %>%
  mutate_at(vars(bb_percent:k_percent), parse_number)
head(mlb_data)
```

```
## # A tibble: 6 x 22
##   name  team      g    pa    hr     r   rbi    sb bb_percent k_percent   iso babip   avg   obp   slg w_oba  w_rc
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mike… Ange…   134   600    45   110   104    11       18.3      20   0.353 0.298 0.291 0.438 0.645 0.436   180
## 2 Alex… Astr…   156   690    41   122   112     5       17.2      12   0.296 0.281 0.296 0.423 0.592 0.418   168
## 3 Chri… Brew…   130   580    44   100    97    30       13.8      20.3 0.342 0.355 0.329 0.429 0.671 0.442   174
## 4 Cody… Dodg…   156   660    47   121   115    15       14.4      16.4 0.324 0.302 0.305 0.406 0.629 0.415   162
## 5 Marc… Athl…   162   747    33   123    92    10       11.6      13.7 0.237 0.294 0.285 0.369 0.522 0.373   137
## 6 Kete… Diam…   144   628    32    97    92    10        8.4      13.7 0.264 0.342 0.329 0.389 0.592 0.405   150
## # … with 5 more variables: bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl>
```

---

## Regression tree example with the [`rpart` package](https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)

Revisit the modeling of `w_oba` from the [KNN slides](http://www.stat.cmu.edu/cmsac/sure/materials/lectures/slides/15-knn-kernel.html#21)

```r
library(rpart)
init_mlb_tree <- rpart(formula = w_oba ~ bb_percent + k_percent + iso,
                       data = mlb_data, method  = "anova")
init_mlb_tree
```

```
## n= 135 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 135 0.151064900 0.3455778  
##    2) iso< 0.226 79 0.043023370 0.3265696  
##      4) iso< 0.1885 49 0.016439350 0.3158163  
##        8) iso< 0.1285 12 0.003165000 0.2975000 *
##        9) iso>=0.1285 37 0.007942811 0.3217568 *
##      5) iso>=0.1885 30 0.011663470 0.3441333 *
##    3) iso>=0.226 56 0.039231360 0.3723929  
##      6) iso< 0.294 47 0.021281320 0.3655957  
##       12) k_percent>=26.55 7 0.003192000 0.3420000 *
##       13) k_percent< 26.55 40 0.013509980 0.3697250  
##         26) k_percent>=14.65 33 0.008832909 0.3658182  
##           52) iso< 0.2375 7 0.001612000 0.3510000 *
##           53) iso>=0.2375 26 0.005270038 0.3698077 *
##         27) k_percent< 14.65 7 0.001798857 0.3881429 *
##      7) iso>=0.294 9 0.004438889 0.4078889 *
```

---

## Display the tree with [`rpart.plot`](plhttp://www.milbo.org/rpart-plot/)

.pull-left[

```r
library(rpart.plot)
rpart.plot(init_mlb_tree)
```

]

.pull-right[

- `rpart()` runs 10-fold CV to tune `$\alpha$` for pruning

- Selects # terminal nodes via 1 SE rule

```r
plotcp(init_mlb_tree)
```

]

---

## What about the full tree? (check out `rpart.control`)

.pull-left[

```r
full_mlb_tree <- rpart(formula = w_oba ~ bb_percent + k_percent + iso,
                       data = mlb_data, method  = "anova", 
                       control = list(cp = 0, xval = 10))
rpart.plot(full_mlb_tree)
```

]

.pull-right[

```r
plotcp(full_mlb_tree)
```

]

---

## Train with `caret`

```r
library(caret)
caret_mlb_tree <- train(w_oba ~ bb_percent + k_percent + iso + avg + obp + slg + war,
                        data = mlb_data, method = "rpart",
                        trControl = trainControl(method = "cv", number = 10),
                        tuneLength = 20)
ggplot(caret_mlb_tree) + theme_bw()
```

---

## Display the final model

```r
rpart.plot(caret_mlb_tree$finalModel)
```

---

## Preview of summarizing variables in tree-based models

.pull-left[
__Variable importance__ - based on reduction in SSE (_notice anything odd?_)

```r
library(vip)
vip(caret_mlb_tree, bar = FALSE) + theme_bw()
```

]
.pull-right[

- Summarize single variable's relationship with __partial dependence plot__

```r
library(pdp)
partial(caret_mlb_tree, pred.var = "obp") %>% autoplot() + theme_bw()
```

]