Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## From nonparametric regression to GAMs
### July 14th, 2022

---

## Model flexibility vs interpretability

[Figure 2.7, Introduction to Statistical Learning with Applications in R (ISLR)](https://www.statlearning.com/)

__Tradeoff__ between model's _flexibility_ (i.e. how "curvy" it is) and how __interpretable__ it is

- Simpler, parametric form of the model `$\Rightarrow$` the easier it is to interpret

---

## Model flexibility vs interpretability

- __Parametric__ models, for which we can write down a mathematical expression for `$f(X)$` __before observing the data__, _a priori_ (e.g. linear regression), __are inherently less flexible__

- __Nonparametric__ models, in which `$f(X)$` is __estimated from the data__ (e.g. kernel regression)

---

## K Nearest Neighbors (KNN)

- Find the `$k$` data points __closest__ to an observation `$x$`, use these to predit

- Need to use some measure of distance, e.g., Euclidean distance

- KNN is data-driven, but we can actually write down the model _a priori_

- Regression:
$$
{\hat Y} \vert X = \frac{1}{k} \sum_{i=1}^k Y_i \,,
$$
- Classification:
$$
\hat{P}[Y = j \vert X] = \frac{1}{k} \sum_{i=1}^k 1(Y_i = j) \,,
$$

- `$1(\cdot)$` is the indicator function: returns 1 if TRUE, and 0 otherwise.

- Summation yields the proportion of neighbors that are of class `$j$`

---

## Finding the optimal number of neighbors `$k$`

__The number of neighbors `$k$` is a tuning parameter__ (like `$\lambda$` is for ridge / lasso)

Determining the optimal value of `$k$` requires balancing bias and variance:

- If `$k$` is too small, the resulting model is *too flexible*,

- low bias (it is right on average...if we apply KNN to an infinite number of datasets sampled from the same parent population) 
  
  - high variance (the predictions have a large spread in values when we apply KNN to our infinite data). See the panels to the left on the next slide.

- If `$k$` is too large, the resulting model is *not flexible enough*,

- high bias (wrong on average) and 
  
  - low variance (nearly same predictions, every time). See the panels to the right on the next slide.

---

## Finding the optimal number of neighbors `$k$`

(Figures 3.16 [top] and 2.16 [bottom], *Introduction to Statistical Learning* by James et al.)

---

## What does KNN remind you of?...

---

## Kernels

A kernel `$K(x)$` is a weighting function used in estimators, and technically has only one required property:

- `$K(x) \geq 0$` for all `$x$`

However, in the manner that kernels are used in statistics, there are two other properties that are usually satisfied:

- `$\int_{-\infty}^\infty K(x) dx = 1$`; and

- `$K(-x) = K(x)$` for all `$x$`.

In short: __a kernel is a symmetric PDF!__

---

## Kernel density estimation

__Goal__: estimate the PDF `$f(x)$` for all possible values (assuming it is continuous / smooth)

$$
\text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)
$$

--
- `$n =$` sample size, `$x =$` new point to estimate `$f(x)$` (does NOT have to be in dataset!)

--
- `$h =$` __bandwidth__, analogous to histogram bin width, ensures `$\hat{f}(x)$` integrates to 1

- `$x_i =$` `$i$`th observation in dataset

--
- `$K_h(x - x_i)$` is the __Kernel__ function, creates __weight__ given distance of `$i$`th observation from new point 
  - as `$|x - x_i| \rightarrow \infty$` then `$K_h(x - x_i) \rightarrow 0$`, i.e. further apart `$i$`th row is from `$x$`, smaller the weight
  
  - as __bandwidth__ `$h \uparrow$` weights are more evenly spread out (as `$h \downarrow$` more concentrated around `$x$`)

- typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `$\propto e^{-(x - x_i)^2 / 2h^2}$`
  
  - `$K_h(x - x_i)$` is large when `$x_i$` is close to `$x$`

---

## Commonly Used Kernels

A general rule of thumb: the choice of kernel will have little effect on estimation, particularly if the sample size is large! The Gaussian kernel (i.e., a normal PDF) is by far the most common choice, and is the default for `R` functions that utilize kernels.

---

## Kernel regression

We can apply kernels in the regression setting as well as in the density estimation setting!

The classic kernel regression estimator is the __Nadaraya-Watson__ estimator:

`$$\hat{y}_h(x) = \sum_{i=1}^n w_i(x) Y_i \,,$$`

where
`$$w_i(x) = \frac{K\left(\frac{x-X_i}{h}\right)}{\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right)} \,.$$`

Regression estimate is the average of all the *weighted* observed response values;

- Farther `$x$` is from observation `$\Rightarrow$` less weight that observation has in determining the regression estimate at `$x$`

---

## Kernel regression

__Nadaraya-Watson kernel regression__

- given training data with explanatory variable `$x$` and continuous response `$y$`

- _bandwidth_ `$h > 0$`

- and a new point `$(x_{new}, y_{new})$`:

`$$\hat{y}_{new} = \sum_{i=1}^n w_i(x_{new})\cdot y_i \,,$$`

where

`$$w_i(x) = \frac{K_h\left(|x_{new}-x_i|\right)}{\sum_{j=1}^n K_h\left(|x_{new}-x_j|\right)} \text{ with } K_h(x) = K(\frac{x}{h})$$`

Example of a __linear smoother__

- class of models where predictions are _weighted_ sums of the response variable

---

## Local regression

We can fit a linear model __at each point `$x_{new}$`__ with weights given by kernel function centered on `$x_{new}$`

- we can additionally combine this with _polynomial regression_

Local regression of the `$k^{th}$` order with kernel function `$K$` solves the following:

`$$\hat{\beta}(x_{new}) = \underset{\beta}{\text{arg min}}\Big\{ \sum_i K_h(|x_{new} - x_i|) \cdot (y_i - \sum_{j=0}^k x_i^k \cdot \beta_k )^2 \Big\}$$`

--
__Yes, this means every single observation has its own set of coefficients__

Predicted value is then:

`$$\hat{y}_{new} = \sum_{j=0}^k x_{new}^k \cdot \hat{\beta}_k(x_{new})$$`

Smoother predictions than kernel regression but comes at __higher computational cost__

- __LOESS__ replaces kernel with k nearest neighbors
  
  - faster than local regression but discontinuities when neighbors change

---

## Smoothing splines

Use __smooth function__ `$s(x)$` to predict `$y$`, control smoothness directly by minimizing the __spline objective function__:

`$$\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int(s''(x))^2dx$$`

`$$= \text{fit data} + \text{impose smoothness}$$`

`$$\Rightarrow \text{model fit} = \text{likelihood} - \lambda \cdot \text{wiggliness}$$`
Estimate the __smoothing spline__ `$\hat{s}(x)$` that __balances the tradeoff between the model fit and wiggliness__

--
<img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffsmooth-1.png?raw=true" width="70%" style="display: block; margin: auto;" />

---

## Basis functions

Splines are _piecewise cubic polynomials_ with __knots__ (boundary points for functions) at every data point

Practical alternative: linear combination of set of __basis functions__

--
__Cubic polynomial example__: define four basis functions:

- `$B_1(x) = 1$`, `$B_2(x) = x$`, `$B_3(x) = x^2$`, `$B_4(x) = x^3$`

where the regression function `$r(x)$` is written as:

`$$r(x) = \sum_j^4 \beta_j B_j(x)$$`

--
- __linear in the transformed variables__ `$B_1(x), \dots, B_4(x)$` but it is __nonlinear in `$x$`__

--
We extend this idea for splines _piecewise_ using indicator functions so the spline is a weighted sum:

`$$s(x) = \sum_j^m \beta_j B_j(x)$$`

---

## Number of basis functions is another tuning parameter

---

## Generalized additive models (GAMs)

GAMs were created by [Trevor Hastie and Rob Tibshirani in 1986](https://projecteuclid.org/euclid.ss/1177013604) with intuitive construction:

- relationships between individual explanatory variables and the response variable are smooth (either linear or nonlinear via basis functions)

- estimate the smooth relationships __simultaneously__ to predict the response by just adding them up

__Generalized__ like GLMs where `$g()$` is the link function for the expected value of the response `$E(Y)$` and __additive__ over the `$p$` variables:

`$$g(E(Y)) = \beta_0 + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)$$`

- can be a convenient balance between flexibility and interpretability

- you can combine linear and nonlinear terms!

---

## Example: predicting MLB HR probability

Used the [`baseballr`](http://billpetti.github.io/baseballr/) package to scrape all batted-balls from 2022 season for EDA project:

```r
library(tidyverse)
batted_ball_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/eda_projects/mlb_batted_balls_2022.csv") %>%
  mutate(is_hr = as.numeric(events == "home_run")) %>%
  filter(!is.na(launch_angle), !is.na(launch_speed),
         !is.na(is_hr))
head(batted_ball_data)
```

```
## # A tibble: 6 x 32
##   player_name   batter stand events    hc_x  hc_y hit_distance_sc launch_speed launch_angle
##   <chr>          <dbl> <chr> <chr>    <dbl> <dbl>           <dbl>        <dbl>        <dbl>
## 1 Daza, Yonath… 602074 R     force_o… 103.  150.               18         97.4           -7
## 2 Robles, Vict… 645302 R     single    58.6 120.              158         80.2           14
## 3 Hoerner, Nico 663538 R     field_o…  99.3 166.               20        101.            -9
## 4 Clemens, Kody 665019 L     field_o… 126.  191.              165         84             64
## 5 Rosario, Amed 642708 R     field_o…  97.4 170.                9         94.3          -14
## 6 Castro, Willi 650489 L     sac_fly  178.   58.9             369         96             29
## # … with 23 more variables: hit_location <dbl>, bb_type <chr>, barrel <dbl>,
## #   pitch_type <chr>, release_speed <dbl>, effective_speed <dbl>,
## #   if_fielding_alignment <chr>, of_fielding_alignment <chr>, game_date <date>,
## #   balls <dbl>, strikes <dbl>, outs_when_up <dbl>, on_1b <dbl>, on_2b <dbl>, on_3b <dbl>,
## #   inning <dbl>, inning_topbot <chr>, home_score <dbl>, away_score <dbl>,
## #   post_home_score <dbl>, post_away_score <dbl>, des <chr>, is_hr <dbl>
```

```r
table(batted_ball_data$is_hr)
```

```
## 
##    0    1 
## 6702  333
```

---

## Predict HRs with launch angle and exit velocity?

.pull-left[

```r
batted_ball_data %>%
  ggplot(aes(x = launch_speed, 
             y = launch_angle,
             color = as.factor(is_hr))) +
  geom_point(alpha = 0.5) +
  ggthemes::scale_color_colorblind(labels = c("No", "Yes")) +
  labs(x = "Exit velocity", 
       y = "Launch angle", 
       color = "HR?") +
  theme_bw() +
  theme(legend.position = "bottom")
```

- HRs are relatively rare and confined to one area of this plot

]
.pull-right[
<img src="18-np-gams_files/figure-html/unnamed-chunk-10-1.png" width="504" />

]

---

## Fitting GAMs with [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html)

First set-up training data

```r
set.seed(2004)
batted_ball_data <- batted_ball_data %>%
  mutate(is_train = sample(rep(0:1, length.out = nrow(batted_ball_data))))
```

Next fit the initial function using smooth functions via `s()`:

```r
library(mgcv)
*init_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle),
                      data = filter(batted_ball_data, is_train == 1), 
*                     family = binomial, method = "REML")
```

- Use [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) instead of the default for more stable solution

---

### GAM summary

```r
summary(init_logit_gam)
```

```
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## is_hr ~ s(launch_speed) + s(launch_angle)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -26.96      10.31  -2.614  0.00895 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df Chi.sq p-value    
## s(launch_speed) 1.000  1.000  151.5  <2e-16 ***
## s(launch_angle) 2.962  3.305  112.0  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.588   Deviance explained = 68.3%
## -REML = 231.49  Scale est. = 1         n = 3517
```

---

### Visualizing partial response functions with [gratia](https://gavinsimpson.github.io/gratia/index.html)

Displays the partial effect of each term in the model `$\Rightarrow$` add up to the overall prediction

```r
library(gratia)
draw(init_logit_gam)
```

---

## Convert to probability scale with `plogis` function

```r
*draw(init_logit_gam, fun = plogis)
```

- centered on average value of 0.5 because it's the partial effect without the intercept

---

## Include intercept in plot...

```r
*draw(init_logit_gam, fun = plogis, constant = coef(init_logit_gam)[1])
```

Intercept reflects relatively rare occurence of HRs!

---

## Model checking for number of basis functions

Use `gam.check()` to see if we need more basis functions based on an approximate test

```r
gam.check(init_logit_gam)
```

```
## 
## Method: REML   Optimizer: outer newton
## full convergence after 11 iterations.
## Gradient range [-5.632542e-05,-2.964163e-06]
## (score 231.4864 & scale 1).
## Hessian positive definite, eigenvalue range [5.631851e-05,0.8679399].
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'  edf k-index p-value  
## s(launch_speed) 9.00 1.00    1.05    1.00  
## s(launch_angle) 9.00 2.96    0.97    0.08 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## Check the predictions?

```r
batted_ball_data <- batted_ball_data %>%
  mutate(init_gam_hr_prob = 
           as.numeric(predict(init_logit_gam,
                              newdata = batted_ball_data,
                              type = "response")),
         init_gam_hr_class = as.numeric(init_gam_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == init_gam_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.977
## 2        1   0.972
```

---

## What about the linear model?

```r
init_linear_logit <- glm(is_hr ~ launch_speed + launch_angle, 
                         data = filter(batted_ball_data, is_train == 1), 
                         family = binomial)
batted_ball_data <- batted_ball_data %>%
  mutate(init_glm_hr_prob = predict(init_linear_logit,
                                    newdata = batted_ball_data,
                                    type = "response"),
         init_glm_hr_class = as.numeric(init_glm_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == init_glm_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.960
## 2        1   0.951
```

__Very few situations in reality where linear regressions perform better than an additive model using smooth functions__ - especially since smooth functions can just capture linear models...

---

## Continuous interactions as a smooth surface

```r
multi_logit_gam <- gam(is_hr ~ s(launch_speed, launch_angle), 
                       data = filter(batted_ball_data, is_train == 1), 
                       family = binomial)
```

Plot the predicted heatmap:

```r
draw(multi_logit_gam)
```

---

## Check the predictions?

```r
batted_ball_data <- batted_ball_data %>%
  mutate(multi_gam_hr_prob =
           as.numeric(predict(multi_logit_gam,
                              newdata = batted_ball_data,
                              type = "response")),
         multi_gam_hr_class = as.numeric(multi_gam_hr_prob >= 0.5))
batted_ball_data %>%
  group_by(is_train) %>%
  summarize(correct = mean(is_hr == multi_gam_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.976
## 2        1   0.972
```

- This has one smoothing parameter for the 2D smooth...

---

### Separate interactions from individual terms with tensor smooths

```r
tensor_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle) +
*                         ti(launch_speed, launch_angle),
                        data = filter(batted_ball_data, is_train == 1), family = binomial)
```

Check the predictions

```r
batted_ball_data %>%
  mutate(tensor_gam_hr_prob =
           as.numeric(predict(tensor_logit_gam, newdata = batted_ball_data, type = "response")),
         tensor_gam_hr_class = as.numeric(tensor_gam_hr_prob >= 0.5)) %>%
  group_by(is_train) %>% 
  summarize(correct = mean(is_hr == tensor_gam_hr_class))
```

```
## # A tibble: 2 x 2
##   is_train correct
##      <int>   <dbl>
## 1        0   0.977
## 2        1   0.972
```

__More complicated model but yet it does not help!__

---

## Recap and useful resources

.pull-left[

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">140 char vrsn<br><br>1 GAMs are just GLMs<br>2 GAMs fit wiggly terms<br>3 use + s(foo) not foo in frmla<br>4 use method = &quot;REML&quot;<br>5 gam.check()</p>&mdash; Dr Gavin Simpson 😷🇪🇺🇩🇰 (@ucfagls) <a href="https://twitter.com/ucfagls/status/842444686513991680?ref_src=twsrc%5Etfw">March 16, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

]
.pull-right[

- [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/)

- [`mgcv` course](https://eric-pedersen.github.io/mgcv-esa-workshop/)

- [Stitch Fix post: GAM: The Predictive Modeling Silver Bullet](https://multithreaded.stitchfix.com/blog/2015/07/30/gam/)

- Chapters 7 and 8 of [Advanced Data Analysis from an Elementary Point of View by Prof Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf)

- I strongly recommend you download this book, and you will refer back to it for the rest of your life
  
]

---
class: inverse center middle

# BONUS CONTENT

---

## KNN in context

Here are two quotes from ISLR to keep in mind when thinking about KNN:

- "As a general rule, parametric methods [like linear regression] will tend to outperform non-parametric approaches [like KNN] when there is a small number of observations per predictor." This is the *curse of dimensionality*: for data-driven models, the amount of data you need to get similar model performance goes up exponentially with `$p$`.

`$\Rightarrow$` KNN might not be a good model to learn when the number of predictor variables is very large.

- "Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model..."

`$\Rightarrow$` KNN is not the best model to learn if inference is the goal of an analysis.

---

## KNN: two critical points to remember

1. To determine which neighbors are the nearest neighbors, pairwise Euclidean distances are computed...so we may need to scale (or standardize) the individual predictor variables so that the distances are not skewed by that one predictor that has the largest variance.

2. Don't blindly compute a pairwise distance matrix! For instance, if `$n$` = 100,000, then your pairwise distance matrix will have `$10^{10}$` elements, each of which uses 8 bytes in memory...resulting in a memory usage of 80 GB! Your laptop cannot handle this. It can barely handle 1-2 GB at this point. If `$n$` is large, you have three options:
    a. subsample your data, limiting `$n$` to be `$\lesssim$` 15,000-20,000;
    b. use a variant of KNN that works with sparse matrices (matrices that can be compressed since most values are zero); or
    c. make use of a "kd tree" to more effectively (but only approximately) identify nearest neighbors.
  
The [`FNN` package in `R`](https://daviddalpiaz.github.io/r4sl/knn-reg.html) has an option to search for neighbors via the use of a kd tree.

--
But instead we will use the [`caret`](http://topepo.github.io/caret/index.html) package...

---

## Example data: MLB 2021 batting statistics

Downloaded MLB 2021 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0)

```r
library(tidyverse)
mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/fg_batting_2021.csv")
head(mlb_data)
```

```
## # A tibble: 6 x 23
##   Name  Team      G    PA    HR     R   RBI    SB `BB%` `K%`    ISO BABIP   AVG   OBP   SLG
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Vlad… TOR      82   354    27    66    69     2 14.4% 17.2% 0.336 0.346 0.336 0.438 0.671
## 2 Fern… SDP      68   288    27    66    58    18 12.5% 28.1% 0.395 0.333 0.302 0.385 0.698
## 3 Carl… HOU      79   347    16    61    52     0 13.5% 17.0% 0.231 0.324 0.298 0.398 0.529
## 4 Marc… TOR      82   372    21    63    54    10 8.9%  23.9% 0.256 0.329 0.286 0.349 0.542
## 5 Rona… ATL      78   342    23    67    51    16 13.2% 24.3% 0.313 0.306 0.278 0.386 0.592
## 6 Shoh… LAA      82   322    31    60    67    12 11.2% 28.0% 0.418 0.29  0.277 0.363 0.695
## # … with 8 more variables: wOBA <dbl>, xwOBA <dbl>, wRC+ <dbl>, BsR <dbl>, Off <dbl>,
## #   Def <dbl>, WAR <dbl>, playerid <dbl>
```

---

## Data cleaning

- [`janitor`](http://sfirke.github.io/janitor/) package has convenient functions for data cleaning like `clean_names()`

- `parse_number()` function provides easy way to convert character to numeric columns

```r
library(janitor)
mlb_data_clean <- clean_names(mlb_data)
mlb_data_clean <- mlb_data_clean %>%
  mutate_at(vars(bb_percent:k_percent), parse_number)
head(mlb_data_clean)
```

```
## # A tibble: 6 x 23
##   name     team      g    pa    hr     r   rbi    sb bb_percent k_percent   iso babip   avg
##   <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl> <dbl> <dbl> <dbl>
## 1 Vladimi… TOR      82   354    27    66    69     2       14.4      17.2 0.336 0.346 0.336
## 2 Fernand… SDP      68   288    27    66    58    18       12.5      28.1 0.395 0.333 0.302
## 3 Carlos … HOU      79   347    16    61    52     0       13.5      17   0.231 0.324 0.298
## 4 Marcus … TOR      82   372    21    63    54    10        8.9      23.9 0.256 0.329 0.286
## 5 Ronald … ATL      78   342    23    67    51    16       13.2      24.3 0.313 0.306 0.278
## 6 Shohei … LAA      82   322    31    60    67    12       11.2      28   0.418 0.29  0.277
## # … with 10 more variables: obp <dbl>, slg <dbl>, w_oba <dbl>, xw_oba <dbl>, w_rc <dbl>,
## #   bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl>
```

---

## KNN example

`caret` is a package of functions designed to simplify training, tuning, and testing statistical learning methods

- first create partitions for training and test data using `createDataPartition()`

```r
library(caret)
set.seed(1960)
train_i <- createDataPartition(y = mlb_data_clean$w_oba, p = 0.7, list = FALSE) %>%
  as.numeric()
train_mlb_data <- mlb_data_clean[train_i,]
test_mlb_data <- mlb_data_clean[-train_i,]
```

- next [`train()`](http://topepo.github.io/caret/model-training-and-tuning.html) to find the optimal `k` on the training data with cross-validation

```r
set.seed(1971)
init_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, 
                            data = train_mlb_data, method = "knn",
                            trControl = trainControl("cv", number = 10),
                            preProcess = c("center", "scale"),
                            tuneLength = 10)
```

---

## KNN example

```r
ggplot(init_knn_mlb_train) + theme_bw()
```

---

## KNN example

Can manually create a __tuning grid__ to search over for the tuning parameter `k`

```r
set.seed(1979)
tune_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, 
                            data = train_mlb_data, method = "knn",
                            trControl = trainControl("cv", number = 10),
                            preProcess = c("center", "scale"),
*                           tuneGrid = expand.grid(k = 2:20))
tune_knn_mlb_train$results
```

```
##     k       RMSE  Rsquared        MAE      RMSESD RsquaredSD       MAESD
## 1   2 0.02664407 0.4918627 0.02190051 0.005738257  0.2369480 0.005007983
## 2   3 0.02396789 0.6196434 0.01953764 0.004799971  0.1591891 0.004490957
## 3   4 0.02390454 0.6207057 0.01964823 0.004899709  0.1827772 0.004354430
## 4   5 0.02336004 0.6244300 0.01938406 0.005161427  0.1949702 0.004402360
## 5   6 0.02303508 0.6335212 0.01917729 0.004972606  0.1561728 0.004147938
## 6   7 0.02366376 0.6262434 0.01946073 0.005400162  0.1666846 0.004507068
## 7   8 0.02424653 0.6101839 0.01990244 0.005227669  0.1568779 0.004293239
## 8   9 0.02421224 0.6337331 0.01979337 0.005645555  0.1611892 0.004766752
## 9  10 0.02415043 0.6448377 0.01986623 0.005677580  0.1629994 0.005095074
## 10 11 0.02464093 0.6455271 0.02023478 0.005640372  0.1534025 0.004948348
## 11 12 0.02487926 0.6445562 0.02042024 0.005492106  0.1663580 0.004889936
## 12 13 0.02504601 0.6374670 0.02045053 0.005894155  0.1922161 0.005138379
## 13 14 0.02501372 0.6494843 0.02040749 0.005674674  0.1756100 0.004930679
## 14 15 0.02486026 0.6691205 0.02037055 0.005720033  0.1546572 0.004830014
## 15 16 0.02508766 0.6651779 0.02044740 0.005831991  0.1872543 0.004866796
## 16 17 0.02533878 0.6609326 0.02061879 0.006109175  0.1821065 0.005149361
## 17 18 0.02545599 0.6582352 0.02055717 0.006196558  0.1896244 0.005195552
## 18 19 0.02542914 0.6665400 0.02057135 0.005954863  0.1888456 0.004914146
## 19 20 0.02600403 0.6528360 0.02103544 0.005907427  0.1993802 0.004836106
```

---

## KNN example

```r
ggplot(tune_knn_mlb_train) + theme_bw()
```

---

## KNN example

```r
tune_knn_mlb_train$bestTune
```

```
##   k
## 5 6
```

```r
test_preds <- predict(tune_knn_mlb_train, test_mlb_data)
head(test_preds)
```

```
## [1] 0.3861667 0.3736667 0.3738333 0.3771667 0.3381667 0.3716667
```

```r
RMSE(test_preds, test_mlb_data$w_oba)
```

```
## [1] 0.02631488
```

---

## Kernel regression with `np`

Use the `npregbw` function to tune bandwidth using [__generalized cross-validation__](https://bookdown.org/egarpor/NP-UC3M/kre-i-bwd.html#kre-i-bwd-cv)

```r
library(np)
mlb_bw0 <- npregbw(w_oba ~ bb_percent + k_percent + iso, 
                            data = train_mlb_data)
```

Generate predictions with `npreg` with provided bandwidth object

```r
*mlb_test_npreg <- npreg(mlb_bw0, newdata = test_mlb_data)
RMSE(mlb_test_npreg$mean, test_mlb_data$w_oba) 
```

```
## [1] 0.02107194
```