Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Smoothing splines and GAMs
### Ron Yurko
### 07/09/2020

---

## Kernel regression

__Nadaraya-Watson kernel regression__

- given training data with explanatory variable `$x$` and continuous response `$y$`

- _bandwidth_ `$h > 0$`

- and a new point `$(x_{new}, y_{new})$`:

`$$\hat{y}_{new} = \sum_{i=1}^n w_i(x_{new})\cdot y_i \,,$$`

where

`$$w_i(x) = \frac{K_h\left(|x_{new}-x_i|\right)}{\sum_{j=1}^n K_h\left(|x_{new}-x_j|\right)} \text{ with } K_h(x) = K(\frac{x}{h})$$`

Example of a __linear smoother__

- class of models where predictions are _weighted_ sums of the response variable

---

## Local regression

We can fit a linear model __at each point `$x_{new}$`__ with weights given by kernel function centered on `$x_{new}$`

- we can additionally combine this with _polynomial regression_

L ocal regression of the `$k^{th}$` order  with kernel function `$K$` solves the following:

`$$\hat{\beta}(x_{new}) = \underset{\beta}{\text{arg min}}\Big\{ \sum_i K_h(|x_{new} - x_i|) \cdot (y_i - \sum_{j=0}^k x_i^k \cdot \beta_k )^2 \Big\}$$`

--
__Yes, this means every single observation has its own set of coefficients__

--
Predicted value is then:

`$$\hat{y}_{new} = \sum_{j=0}^k x_{new}^k \cdot \hat{\beta}_k(x_{new})$$`

--
Smoother predictions than kernel regression but comes at __higher computational cost__

- __LOESS__ replaces kernel with k nearest neighbors
  - faster than local regression but discontinuities when neighbors change

---

## Smoothing splines

Let `$s(x)$` be a __smooth function__ to predict `$y$` - we control smoothness directly by minimizing the __spline objective function__:

`$$\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int(s''(x))^2dx$$`

--
`$$= \text{fit data} + \text{impose smoothness}$$`

--
`$$\Rightarrow \text{model fit} = \text{likelihood} - \lambda \cdot \text{wiggliness}$$`
Estimate the __smoothing spline__ `$\hat{s}(x)$` that __balances the tradeoff between the model fit and wiggliness__

--
<img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffsmooth-1.png?raw=true" width="70%" style="display: block; margin: auto;" />

---

## Basis functions

Splines are _piecewise cubic polynomials_ with __knots__ (boundary points for functions) at every data point

--
Practical alternative: linear combination of set of __basis functions__

--
__Cubic polynomial example__: define four basis functions:

- `$B_1(x) = 1$`, `$B_2(x) = x$`, `$B_3(x) = x^2$`, `$B_4(x) = x^3$`

where the regression function `$r(x)$` is written as:

`$$r(x) = \sum_j^4 \beta_j B_j(x)$$`

--
- __linear in the transformed variables__ `$B_1(x), \dots, B_4(x)$` but it is __nonlinear in `$x$`__

--
We extend this idea for splines _piecewise_ using indicator functions so the spline is a weighted sum:

`$$s(x) = \sum_j^m \beta_j B_j(x)$$`

---

## Number of basis functions is another tuning parameter

---

## Generalized additive models (GAMs)

GAMs were created by [Trevor Hastie and Rob Tibshirani in 1986](https://projecteuclid.org/euclid.ss/1177013604) with intuitive construction:

- relationships between individual explanatory variables and the response variable are smooth (either linear or nonlinear via basis functions)

- estimate the smooth relationships __simultaneously__ to predict the response by just adding them up

__Generalized__ like GLMs where `$g()$` is the link function for the expected value of the response `$E(Y)$` and __additive__ over the `$p$` variables:

`$$g(E(Y)) = \beta_0 + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)$$`

--
- can be a convenient balance between flexibility and interpretability

- you can combine linear and nonlinear terms!

---

## Example: predicting MLB HR probability

Used the [`baseballr`](http://billpetti.github.io/baseballr/) package to scrape all batted-balls from 2019 season:

```r
library(tidyverse)
batted_ball_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/mlb_2019_batted_balls.csv") %>%
  mutate(is_hr = as.numeric(events == "home_run"))
head(batted_ball_data)
```

```
## # A tibble: 6 x 4
##   events    launch_speed launch_angle is_hr
##   <chr>            <dbl>        <dbl> <dbl>
## 1 field_out           89           38     0
## 2 field_out           89           38     0
## 3 force_out           84          -13     0
## 4 field_out           89           38     0
## 5 single              90           15     0
## 6 single              90           15     0
```

```r
table(batted_ball_data$is_hr)
```

```
## 
##      0      1 
## 120661   6871
```

---

## Fitting GAMs with [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html)
First set-up training data

```r
set.seed(2004)
batted_ball_data <- batted_ball_data %>%
  mutate(is_train = sample(rep(0:1, length.out = nrow(batted_ball_data))))
```

Next fit the initial function using smooth functions via `s()`:

```r
library(mgcv)
init_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle), 
                      data = filter(batted_ball_data, is_train == 1), 
                      family = binomial)
```

---

## Visualizing partial response functions

```r
plot(init_logit_gam, pages = 1)
```

---

## Convert to probability scale

```r
plot(init_logit_gam, pages = 1, trans = plogis)
```

---

## Model checking for number of basis functions

Use `gam.check()` to see if we need more basis functions based on an approximate test

```r
gam.check(init_logit_gam)
```

```
## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 10 iterations.
## Gradient range [-4.058613e-07,-5.76112e-08]
## (score -0.8744526 & scale 1).
## Hessian positive definite, eigenvalue range [5.742399e-08,0.001000449].
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'  edf k-index p-value
## s(launch_speed) 9.00 1.00    1.00    0.65
## s(launch_angle) 9.00 4.43    1.01    0.81
```

---

## Check the predictions?

```r
batted_ball_data <- batted_ball_data %>%
  mutate(init_gam_hr_prob = predict(init_logit_gam,
                                    newdata = batted_ball_data,
                                    type = "response"),
         init_gam_hr_class = as.numeric(init_gam_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == init_gam_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.971
## 2        1   0.973
```

---

## What about the linear model?

```r
init_linear_logit <- glm(is_hr ~ launch_speed + launch_angle, 
                         data = filter(batted_ball_data, is_train == 1), 
                         family = binomial)
batted_ball_data <- batted_ball_data %>%
  mutate(init_glm_hr_prob = predict(init_linear_logit,
                                    newdata = batted_ball_data,
                                    type = "response"),
         init_glm_hr_class = as.numeric(init_glm_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == init_glm_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.949
## 2        1   0.950
```

__Very few situations in reality where linear regressions do better than an additive model using smooth functions__

---

## Continuous interactions as a smooth surface

```r
multi_logit_gam <- gam(is_hr ~ s(launch_speed, launch_angle), 
                       data = filter(batted_ball_data, is_train == 1), 
                       family = binomial)
```

Plot the predicted heatmap:

```r
plot(multi_logit_gam, scheme = 2)
```

---

## Check the predictions?

```r
batted_ball_data <- batted_ball_data %>%
  mutate(multi_gam_hr_prob = predict(multi_logit_gam,
                                    newdata = batted_ball_data,
                                    type = "response"),
         multi_gam_hr_class = as.numeric(multi_gam_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == multi_gam_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.971
## 2        1   0.973
```

__More complicated model but yet it does not help!__

---

## Useful resources

- Free online course: [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/)

- [Stitch Fix post: GAM: The Predictive Modeling Silver Bullet](https://multithreaded.stitchfix.com/blog/2015/07/30/gam/)

- [Gavin Simpson's `gratia` package](https://gavinsimpson.github.io/gratia/index.html) for `ggplot2` visualizations

- Chapters 7 and 8 of [Advanced Data Analysis from an Elementary Point of View by Prof Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf)

- I strongly recommend you download this book, and you will refer back to it for the rest of your life