Supervised Learning

class: center, middle, inverse, title-slide

# Supervised Learning
## Nonparametric regression
### Ron Yurko
### 07/06/2020

---

## Model flexibility vs interpretability

[Figure 2.7, Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/)

__Tradeoff__ between a model's _flexibility_ (i.e. how "curvy" it is) and how __interpretable__ it is

- Simpler the parametric form of the model `$\Rightarrow$` the easier it is to interpret

- Hence why __linear regression__ is popular in practice

---

## Model flexibility vs interpretability

- __Parametric__ models, for which we can write down a mathematical expression for `$f(X)$` __before observing the data__, _a priori_ (e.g. linear regression), __are inherently less flexible__

--
- __Nonparametric__ models, in which `$f(X)$` is __estimated from the data__ (e.g. kernel regression)

---

## K Nearest Neighbors (KNN)

__In words:__ KNN examines the `$k$` data points closest to a location `$x$` and uses just those data to generate predictions. The optimal value of `$k$` is that which minimizes validation-set MSE (regression) or, e.g., MCR (classification).

KNN straddles the boundary between fully parameterized models like linear regression and fully data-driven models like random forests. A KNN model is data-driven, but one *can* actually write down a compact parametric form for the model *a priori*:

- For regression:
$$
{\hat Y} \vert X = \frac{1}{k} \sum_{i=1}^k Y_i \,,
$$
- For classification:
$$
P[Y = j \vert X] = \frac{1}{k} \sum_{i=1}^k I(Y_i = j) \,,
$$
where `$I(\cdot)$` is the indicator function: it returns 0 if the argument is false, and 1 otherwise. The summation yields the proportion of neighbors that are of class `$j$`.

---

## Finding the optimal number of neighbors `$k$`

__The number of neighbors `$k$` is a tuning parameter__ (like `$\lambda$` is for ridge / lasso)

As is the case elsewhere in statistical learning, determining the optimal value of `$k$` requires balancing bias and variance:

- If `$k$` is too small, the resulting model is *too flexible*,

- low bias (it is right on average...if we apply KNN to an infinite number of datasets sampled from the same parent population) 
  
  - high variance (the predictions have a large spread in values when we apply KNN to our infinite data). See the panels to the left on the next slide.

- If `$k$` is too large, the resulting model is *not flexible enough*,

- high bias (wrong on average) and 
  
  - low variance (nearly same predictions, every time). See the panels to the right on the next slide.

---

## Finding the optimal number of neighbors `$k$`

(Figures 3.16 [top] and 2.16 [bottom], *Introduction to Statistical Learning* by James et al.)

---

## KNN in context

Here are two quotes from ISLR to keep in mind when thinking about KNN:

- "As a general rule, parametric methods [like linear regression] will tend to outperform non-parametric approaches [like KNN] when there is a small number of observations per predictor." This is the *curse of dimensionality*: for data-driven models, the amount of data you need to get similar model performance goes up exponentially with `$p$`.

`$\Rightarrow$` KNN might not be a good model to learn when the number of predictor variables is very large.

- "Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model..."

`$\Rightarrow$` KNN is not the best model to learn if inference is the goal of an analysis.

---

## KNN: two critical points to remember

1. To determine which neighbors are the nearest neighbors, pairwise Euclidean distances are computed...so we may need to scale (or standardize) the individual predictor variables so that the distances are not skewed by that one predictor that has the largest variance.

2. Don't blindly compute a pairwise distance matrix! For instance, if `$n$` = 100,000, then your pairwise distance matrix will have `$10^{10}$` elements, each of which uses 8 bytes in memory...resulting in a memory usage of 80 GB! Your laptop cannot handle this. It can barely handle 1-2 GB at this point. If `$n$` is large, you have three options:
    a. subsample your data, limiting `$n$` to be `$\lesssim$` 15,000-20,000;
    b. use a variant of KNN that works with sparse matrices (matrices that can be compressed since most values are zero); or
    c. make use of a "kd tree" to more effectively (but only approximately) identify nearest neighbors.
  
The [`FNN` package in `R`](https://daviddalpiaz.github.io/r4sl/knn-reg.html) has an option to search for neighbors via the use of a kd tree.

--
But instead we will use the [`caret`](http://topepo.github.io/caret/index.html) package...

---

## Example data: MLB 2019 batting statistics

Downloaded MLB 2019 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0)

```r
library(tidyverse)
mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/fg_batting_2019.csv")
head(mlb_data)
```

```
## # A tibble: 6 x 22
##   Name    Team      G    PA    HR     R   RBI    SB `BB%` `K%`    ISO BABIP   AVG   OBP   SLG  wOBA `wRC+`   BsR
##   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 Mike T… Ange…   134   600    45   110   104    11 18.3% 20.0% 0.353 0.298 0.291 0.438 0.645 0.436    180   7.1
## 2 Alex B… Astr…   156   690    41   122   112     5 17.2% 12.0% 0.296 0.281 0.296 0.423 0.592 0.418    168  -2.1
## 3 Christ… Brew…   130   580    44   100    97    30 13.8% 20.3% 0.342 0.355 0.329 0.429 0.671 0.442    174   8.5
## 4 Cody B… Dodg…   156   660    47   121   115    15 14.4% 16.4% 0.324 0.302 0.305 0.406 0.629 0.415    162   1.4
## 5 Marcus… Athl…   162   747    33   123    92    10 11.6% 13.7% 0.237 0.294 0.285 0.369 0.522 0.373    137   1.7
## 6 Ketel … Diam…   144   628    32    97    92    10 8.4%  13.7% 0.264 0.342 0.329 0.389 0.592 0.405    150   4.2
## # … with 4 more variables: Off <dbl>, Def <dbl>, WAR <dbl>, playerid <dbl>
```

---

## Data cleaning

- [`janitor`](http://sfirke.github.io/janitor/) package has convenient functions for data cleaning like `clean_names()`

- `parse_number()` function provides easy way to convert character to numeric columns

```r
library(janitor)
mlb_data_clean <- clean_names(mlb_data)
mlb_data_clean <- mlb_data_clean %>%
  mutate_at(vars(bb_percent:k_percent), parse_number)
head(mlb_data_clean)
```

```
## # A tibble: 6 x 22
##   name  team      g    pa    hr     r   rbi    sb bb_percent k_percent   iso babip   avg   obp   slg w_oba  w_rc
##   <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mike… Ange…   134   600    45   110   104    11       18.3      20   0.353 0.298 0.291 0.438 0.645 0.436   180
## 2 Alex… Astr…   156   690    41   122   112     5       17.2      12   0.296 0.281 0.296 0.423 0.592 0.418   168
## 3 Chri… Brew…   130   580    44   100    97    30       13.8      20.3 0.342 0.355 0.329 0.429 0.671 0.442   174
## 4 Cody… Dodg…   156   660    47   121   115    15       14.4      16.4 0.324 0.302 0.305 0.406 0.629 0.415   162
## 5 Marc… Athl…   162   747    33   123    92    10       11.6      13.7 0.237 0.294 0.285 0.369 0.522 0.373   137
## 6 Kete… Diam…   144   628    32    97    92    10        8.4      13.7 0.264 0.342 0.329 0.389 0.592 0.405   150
## # … with 5 more variables: bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl>
```

---

## KNN example

`caret` is a package of functions designed to simplify training, tuning, and testing statistical learning methods

- first create partitions for training and test data using `createDataPartition()`

```r
set.seed(1960)
train_i <- createDataPartition(y = mlb_data_clean$w_oba, p = 0.7, list = FALSE) %>%
  as.numeric()
train_mlb_data <- mlb_data_clean[train_i,]
test_mlb_data <- mlb_data_clean[-train_i,]
```

- next [`train()`](http://topepo.github.io/caret/model-training-and-tuning.html) to find the optimal `k` on the training data with cross-validation

```r
set.seed(1971)
init_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, 
                            data = train_mlb_data, method = "knn",
                            trControl = trainControl("cv", number = 10),
                            preProcess = c("center", "scale"),
                            tuneLength = 10)
```

---

## KNN example

```r
plot(init_knn_mlb_train)
```

---

## KNN example

Can manually create a __tuning grid__ to search over for the tuning parameter `k`

```r
set.seed(1979)
tune_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, 
                            data = train_mlb_data, method = "knn",
                            trControl = trainControl("cv", number = 10),
                            preProcess = c("center", "scale"),
*                           tuneGrid = expand.grid(k = 2:20))
tune_knn_mlb_train$results
```

```
##     k       RMSE  Rsquared        MAE      RMSESD RsquaredSD       MAESD
## 1   2 0.02422240 0.5015161 0.01923169 0.003490421  0.2267108 0.003644234
## 2   3 0.02267126 0.5484469 0.01854557 0.003701281  0.2105002 0.003496073
## 3   4 0.02121419 0.6100641 0.01751832 0.003172789  0.1833197 0.003366532
## 4   5 0.02131950 0.6190750 0.01755824 0.003467259  0.1737263 0.003433994
## 5   6 0.02040682 0.6487335 0.01672642 0.003422221  0.1847779 0.003225498
## 6   7 0.02067103 0.6538541 0.01703278 0.003584345  0.1886355 0.003490829
## 7   8 0.02021542 0.6857249 0.01661766 0.003079793  0.1643981 0.002614104
## 8   9 0.01989941 0.7110146 0.01628716 0.003013108  0.1743321 0.002476301
## 9  10 0.02037611 0.7102667 0.01659815 0.003130034  0.1755448 0.002645629
## 10 11 0.02076512 0.6918662 0.01711617 0.002900292  0.1898689 0.002430482
## 11 12 0.02095577 0.6921144 0.01712723 0.003128792  0.1823307 0.002643407
## 12 13 0.02109094 0.6962511 0.01704114 0.003355928  0.1681903 0.002984102
## 13 14 0.02153544 0.6906963 0.01734505 0.002979215  0.1834400 0.002696124
## 14 15 0.02155216 0.6991005 0.01730960 0.002851471  0.1818701 0.002486504
## 15 16 0.02179000 0.6955016 0.01747865 0.003137918  0.1808767 0.002562037
## 16 17 0.02180227 0.6943468 0.01761774 0.003255189  0.1829315 0.002646446
## 17 18 0.02212422 0.6910580 0.01797719 0.003079922  0.1777369 0.002489574
## 18 19 0.02243315 0.6782421 0.01825869 0.003112202  0.1728950 0.002460854
## 19 20 0.02261279 0.6767660 0.01852271 0.003181720  0.1710694 0.002686753
```

---

## KNN example

```r
plot(tune_knn_mlb_train)
```

---

## KNN example

```r
tune_knn_mlb_train$bestTune
```

```
##   k
## 8 9
```

```r
test_preds <- predict(tune_knn_mlb_train, test_mlb_data)
head(test_preds)
```

```
## [1] 0.4022222 0.3657778 0.3481111 0.3603333 0.3684444 0.3778889
```

```r
RMSE(test_preds, test_mlb_data$w_oba)
```

```
## [1] 0.02012412
```

---

## What does KNN remind you of?...

---

## Kernels

A kernel `$K(x)$` is a weighting function used in estimators. Full stop. A kernel technically has only one required property:

- `$K(x) \geq 0$` for all `$x$`.

However, in the manner that kernels are used in statistics, there are two other properties that are usually satisfied:

- `$\int_{-\infty}^\infty K(x) dx = 1$`; and

- `$K(-x) = K(x)$` for all `$x$`.

In short: a kernel is a symmetric pdf!

---

## Kernel density estimation

__Goal__: estimate the PDF `$f(x)$` for all possible values (assuming it is continuous / smooth)

$$
\text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)
$$

--
- `$n =$` sample size, `$x =$` new point to estimate `$f(x)$` (does NOT have to be in dataset!)

--
- `$h =$` __bandwidth__, analogous to histogram bin width, ensures `$\hat{f}(x)$` integrates to 1

- `$x_i =$` `$i$`th observation in dataset

--
- `$K_h(x - x_i)$` is the __Kernel__ function, creates __weight__ given distance of `$i$`th observation from new point 
  - as `$|x - x_i| \rightarrow \infty$` then `$K_h(x - x_i) \rightarrow 0$`, i.e. further apart `$i$`th row is from `$x$`, smaller the weight
  
  - as __bandwidth__ `$h \uparrow$` weights are more evenly spread out (as `$h \downarrow$` more concentrated around `$x$`)

- typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `$\propto e^{-(x - x_i)^2 / 2h^2}$`
  
  - `$K_h(x - x_i)$` is large when `$x_i$` is close to `$x$`

---

## Commonly Used Kernels

A general rule of thumb: the choice of kernel will have little effect on estimation, particularly if the sample size is large! The Gaussian kernel (i.e., a normal pdf) is by far the most common choice, and is the default for `R` functions that utilize kernels.

---

## Kernel regression

As a final note, realize that one can apply kernels in the regression setting as well as in the density estimation setting.

The classic kernel regression estimator is the __Nadaraya-Watson__ estimator:

`$$\hat{y}_h(x) = \sum_{i=1}^n l_i(x) Y_i \,,$$`

where
`$$l_i(x) = \frac{K\left(\frac{x-X_i}{h}\right)}{\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right)} \,.$$`

Basically, the regression estimate is the average of all the *weighted* observed response values; the farther `$x$` is from an observation, the less weight that observation has in determining the regression estimate at `$x$`.

The workhorse function for kernel regression in `R` is `ksmooth()` from the base `stats` package. Tuning to find the optimal value of `$h$` is not necessarily simple; you can use `$n^{-1/5}$` as an initial estimate.