class: center, middle, inverse, title-slide # Supervised Learning ## Nonparametric regression ### Ron Yurko ### 07/06/2020 --- ## Model flexibility vs interpretability [Figure 2.7, Introduction to Statistical Learning with Applications in R (ISLR)](http://faculty.marshall.usc.edu/gareth-james/ISL/) <img src="http://www.stat.cmu.edu/~pfreeman/flexibility.png" width="50%" style="display: block; margin: auto;" /> __Tradeoff__ between a model's _flexibility_ (i.e. how "curvy" it is) and how __interpretable__ it is - Simpler the parametric form of the model `\(\Rightarrow\)` the easier it is to interpret - Hence why __linear regression__ is popular in practice --- ## Model flexibility vs interpretability <img src="http://www.stat.cmu.edu/~pfreeman/flexibility.png" width="50%" style="display: block; margin: auto;" /> - __Parametric__ models, for which we can write down a mathematical expression for `\(f(X)\)` __before observing the data__, _a priori_ (e.g. linear regression), __are inherently less flexible__ -- - __Nonparametric__ models, in which `\(f(X)\)` is __estimated from the data__ (e.g. kernel regression) --- ## K Nearest Neighbors (KNN) __In words:__ KNN examines the `\(k\)` data points closest to a location `\(x\)` and uses just those data to generate predictions. The optimal value of `\(k\)` is that which minimizes validation-set MSE (regression) or, e.g., MCR (classification). KNN straddles the boundary between fully parameterized models like linear regression and fully data-driven models like random forests. A KNN model is data-driven, but one *can* actually write down a compact parametric form for the model *a priori*: -- - For regression: $$ {\hat Y} \vert X = \frac{1}{k} \sum_{i=1}^k Y_i \,, $$ - For classification: $$ P[Y = j \vert X] = \frac{1}{k} \sum_{i=1}^k I(Y_i = j) \,, $$ where `\(I(\cdot)\)` is the indicator function: it returns 0 if the argument is false, and 1 otherwise. The summation yields the proportion of neighbors that are of class `\(j\)`. --- ## Finding the optimal number of neighbors `\(k\)` __The number of neighbors `\(k\)` is a tuning parameter__ (like `\(\lambda\)` is for ridge / lasso) -- As is the case elsewhere in statistical learning, determining the optimal value of `\(k\)` requires balancing bias and variance: - If `\(k\)` is too small, the resulting model is *too flexible*, - low bias (it is right on average...if we apply KNN to an infinite number of datasets sampled from the same parent population) - high variance (the predictions have a large spread in values when we apply KNN to our infinite data). See the panels to the left on the next slide. -- - If `\(k\)` is too large, the resulting model is *not flexible enough*, - high bias (wrong on average) and - low variance (nearly same predictions, every time). See the panels to the right on the next slide. --- ## Finding the optimal number of neighbors `\(k\)` <img src="http://www.stat.cmu.edu/~pfreeman/Fig_3.16.png" width="40%" style="display: block; margin: auto;" /> <img src="http://www.stat.cmu.edu/~pfreeman/Fig_2.16.png" width="40%" style="display: block; margin: auto;" /> (Figures 3.16 [top] and 2.16 [bottom], *Introduction to Statistical Learning* by James et al.) --- ## KNN in context Here are two quotes from ISLR to keep in mind when thinking about KNN: - "As a general rule, parametric methods [like linear regression] will tend to outperform non-parametric approaches [like KNN] when there is a small number of observations per predictor." This is the *curse of dimensionality*: for data-driven models, the amount of data you need to get similar model performance goes up exponentially with `\(p\)`. -- `\(\Rightarrow\)` KNN might not be a good model to learn when the number of predictor variables is very large. -- - "Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model..." -- `\(\Rightarrow\)` KNN is not the best model to learn if inference is the goal of an analysis. --- ## KNN: two critical points to remember 1. To determine which neighbors are the nearest neighbors, pairwise Euclidean distances are computed...so we may need to scale (or standardize) the individual predictor variables so that the distances are not skewed by that one predictor that has the largest variance. -- 2. Don't blindly compute a pairwise distance matrix! For instance, if `\(n\)` = 100,000, then your pairwise distance matrix will have `\(10^{10}\)` elements, each of which uses 8 bytes in memory...resulting in a memory usage of 80 GB! Your laptop cannot handle this. It can barely handle 1-2 GB at this point. If `\(n\)` is large, you have three options: a. subsample your data, limiting `\(n\)` to be `\(\lesssim\)` 15,000-20,000; b. use a variant of KNN that works with sparse matrices (matrices that can be compressed since most values are zero); or c. make use of a "kd tree" to more effectively (but only approximately) identify nearest neighbors. The [`FNN` package in `R`](https://daviddalpiaz.github.io/r4sl/knn-reg.html) has an option to search for neighbors via the use of a kd tree. -- But instead we will use the [`caret`](http://topepo.github.io/caret/index.html) package... --- ## Example data: MLB 2019 batting statistics Downloaded MLB 2019 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0) ```r library(tidyverse) mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/fg_batting_2019.csv") head(mlb_data) ``` ``` ## # A tibble: 6 x 22 ## Name Team G PA HR R RBI SB `BB%` `K%` ISO BABIP AVG OBP SLG wOBA `wRC+` BsR ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Mike T… Ange… 134 600 45 110 104 11 18.3% 20.0% 0.353 0.298 0.291 0.438 0.645 0.436 180 7.1 ## 2 Alex B… Astr… 156 690 41 122 112 5 17.2% 12.0% 0.296 0.281 0.296 0.423 0.592 0.418 168 -2.1 ## 3 Christ… Brew… 130 580 44 100 97 30 13.8% 20.3% 0.342 0.355 0.329 0.429 0.671 0.442 174 8.5 ## 4 Cody B… Dodg… 156 660 47 121 115 15 14.4% 16.4% 0.324 0.302 0.305 0.406 0.629 0.415 162 1.4 ## 5 Marcus… Athl… 162 747 33 123 92 10 11.6% 13.7% 0.237 0.294 0.285 0.369 0.522 0.373 137 1.7 ## 6 Ketel … Diam… 144 628 32 97 92 10 8.4% 13.7% 0.264 0.342 0.329 0.389 0.592 0.405 150 4.2 ## # … with 4 more variables: Off <dbl>, Def <dbl>, WAR <dbl>, playerid <dbl> ``` --- ## Data cleaning - [`janitor`](http://sfirke.github.io/janitor/) package has convenient functions for data cleaning like `clean_names()` - `parse_number()` function provides easy way to convert character to numeric columns ```r library(janitor) mlb_data_clean <- clean_names(mlb_data) mlb_data_clean <- mlb_data_clean %>% mutate_at(vars(bb_percent:k_percent), parse_number) head(mlb_data_clean) ``` ``` ## # A tibble: 6 x 22 ## name team g pa hr r rbi sb bb_percent k_percent iso babip avg obp slg w_oba w_rc ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Mike… Ange… 134 600 45 110 104 11 18.3 20 0.353 0.298 0.291 0.438 0.645 0.436 180 ## 2 Alex… Astr… 156 690 41 122 112 5 17.2 12 0.296 0.281 0.296 0.423 0.592 0.418 168 ## 3 Chri… Brew… 130 580 44 100 97 30 13.8 20.3 0.342 0.355 0.329 0.429 0.671 0.442 174 ## 4 Cody… Dodg… 156 660 47 121 115 15 14.4 16.4 0.324 0.302 0.305 0.406 0.629 0.415 162 ## 5 Marc… Athl… 162 747 33 123 92 10 11.6 13.7 0.237 0.294 0.285 0.369 0.522 0.373 137 ## 6 Kete… Diam… 144 628 32 97 92 10 8.4 13.7 0.264 0.342 0.329 0.389 0.592 0.405 150 ## # … with 5 more variables: bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl> ``` --- ## KNN example `caret` is a package of functions designed to simplify training, tuning, and testing statistical learning methods - first create partitions for training and test data using `createDataPartition()` ```r set.seed(1960) train_i <- createDataPartition(y = mlb_data_clean$w_oba, p = 0.7, list = FALSE) %>% as.numeric() train_mlb_data <- mlb_data_clean[train_i,] test_mlb_data <- mlb_data_clean[-train_i,] ``` -- - next [`train()`](http://topepo.github.io/caret/model-training-and-tuning.html) to find the optimal `k` on the training data with cross-validation ```r set.seed(1971) init_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, data = train_mlb_data, method = "knn", trControl = trainControl("cv", number = 10), preProcess = c("center", "scale"), tuneLength = 10) ``` --- ## KNN example ```r plot(init_knn_mlb_train) ``` <img src="15-knn-kernel_files/figure-html/plot-knn-1.png" width="504" style="display: block; margin: auto;" /> --- ## KNN example Can manually create a __tuning grid__ to search over for the tuning parameter `k` ```r set.seed(1979) tune_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, data = train_mlb_data, method = "knn", trControl = trainControl("cv", number = 10), preProcess = c("center", "scale"), * tuneGrid = expand.grid(k = 2:20)) tune_knn_mlb_train$results ``` ``` ## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 0.02422240 0.5015161 0.01923169 0.003490421 0.2267108 0.003644234 ## 2 3 0.02267126 0.5484469 0.01854557 0.003701281 0.2105002 0.003496073 ## 3 4 0.02121419 0.6100641 0.01751832 0.003172789 0.1833197 0.003366532 ## 4 5 0.02131950 0.6190750 0.01755824 0.003467259 0.1737263 0.003433994 ## 5 6 0.02040682 0.6487335 0.01672642 0.003422221 0.1847779 0.003225498 ## 6 7 0.02067103 0.6538541 0.01703278 0.003584345 0.1886355 0.003490829 ## 7 8 0.02021542 0.6857249 0.01661766 0.003079793 0.1643981 0.002614104 ## 8 9 0.01989941 0.7110146 0.01628716 0.003013108 0.1743321 0.002476301 ## 9 10 0.02037611 0.7102667 0.01659815 0.003130034 0.1755448 0.002645629 ## 10 11 0.02076512 0.6918662 0.01711617 0.002900292 0.1898689 0.002430482 ## 11 12 0.02095577 0.6921144 0.01712723 0.003128792 0.1823307 0.002643407 ## 12 13 0.02109094 0.6962511 0.01704114 0.003355928 0.1681903 0.002984102 ## 13 14 0.02153544 0.6906963 0.01734505 0.002979215 0.1834400 0.002696124 ## 14 15 0.02155216 0.6991005 0.01730960 0.002851471 0.1818701 0.002486504 ## 15 16 0.02179000 0.6955016 0.01747865 0.003137918 0.1808767 0.002562037 ## 16 17 0.02180227 0.6943468 0.01761774 0.003255189 0.1829315 0.002646446 ## 17 18 0.02212422 0.6910580 0.01797719 0.003079922 0.1777369 0.002489574 ## 18 19 0.02243315 0.6782421 0.01825869 0.003112202 0.1728950 0.002460854 ## 19 20 0.02261279 0.6767660 0.01852271 0.003181720 0.1710694 0.002686753 ``` --- ## KNN example ```r plot(tune_knn_mlb_train) ``` <img src="15-knn-kernel_files/figure-html/plot-knn-tune-1.png" width="504" style="display: block; margin: auto;" /> --- ## KNN example ```r tune_knn_mlb_train$bestTune ``` ``` ## k ## 8 9 ``` ```r test_preds <- predict(tune_knn_mlb_train, test_mlb_data) head(test_preds) ``` ``` ## [1] 0.4022222 0.3657778 0.3481111 0.3603333 0.3684444 0.3778889 ``` ```r RMSE(test_preds, test_mlb_data$w_oba) ``` ``` ## [1] 0.02012412 ``` --- ## What does KNN remind you of?... <img src="https://media1.giphy.com/media/12Gyz2J1b9SjD2/200w.gif" width="40%" style="display: block; margin: auto;" /> --- ## Kernels A kernel `\(K(x)\)` is a weighting function used in estimators. Full stop. A kernel technically has only one required property: - `\(K(x) \geq 0\)` for all `\(x\)`. However, in the manner that kernels are used in statistics, there are two other properties that are usually satisfied: - `\(\int_{-\infty}^\infty K(x) dx = 1\)`; and - `\(K(-x) = K(x)\)` for all `\(x\)`. In short: a kernel is a symmetric pdf! --- ## Kernel density estimation __Goal__: estimate the PDF `\(f(x)\)` for all possible values (assuming it is continuous / smooth) -- $$ \text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) $$ -- - `\(n =\)` sample size, `\(x =\)` new point to estimate `\(f(x)\)` (does NOT have to be in dataset!) -- - `\(h =\)` __bandwidth__, analogous to histogram bin width, ensures `\(\hat{f}(x)\)` integrates to 1 - `\(x_i =\)` `\(i\)`th observation in dataset -- - `\(K_h(x - x_i)\)` is the __Kernel__ function, creates __weight__ given distance of `\(i\)`th observation from new point - as `\(|x - x_i| \rightarrow \infty\)` then `\(K_h(x - x_i) \rightarrow 0\)`, i.e. further apart `\(i\)`th row is from `\(x\)`, smaller the weight - as __bandwidth__ `\(h \uparrow\)` weights are more evenly spread out (as `\(h \downarrow\)` more concentrated around `\(x\)`) - typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `\(\propto e^{-(x - x_i)^2 / 2h^2}\)` - `\(K_h(x - x_i)\)` is large when `\(x_i\)` is close to `\(x\)` --- ## Commonly Used Kernels <img src="http://www.stat.cmu.edu/~pfreeman/kernels.png" width="40%" style="display: block; margin: auto;" /> A general rule of thumb: the choice of kernel will have little effect on estimation, particularly if the sample size is large! The Gaussian kernel (i.e., a normal pdf) is by far the most common choice, and is the default for `R` functions that utilize kernels. --- ## Kernel regression As a final note, realize that one can apply kernels in the regression setting as well as in the density estimation setting. The classic kernel regression estimator is the __Nadaraya-Watson__ estimator: `$$\hat{y}_h(x) = \sum_{i=1}^n l_i(x) Y_i \,,$$` where `$$l_i(x) = \frac{K\left(\frac{x-X_i}{h}\right)}{\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right)} \,.$$` Basically, the regression estimate is the average of all the *weighted* observed response values; the farther `\(x\)` is from an observation, the less weight that observation has in determining the regression estimate at `\(x\)`. The workhorse function for kernel regression in `R` is `ksmooth()` from the base `stats` package. Tuning to find the optimal value of `\(h\)` is not necessarily simple; you can use `\(n^{-1/5}\)` as an initial estimate.