class: center, middle, inverse, title-slide # Supervised Learning ## From nonparametric regression to GAMs ### July 14th, 2022 --- ## Model flexibility vs interpretability [Figure 2.7, Introduction to Statistical Learning with Applications in R (ISLR)](https://www.statlearning.com/) <img src="http://www.stat.cmu.edu/~pfreeman/flexibility.png" width="50%" style="display: block; margin: auto;" /> __Tradeoff__ between model's _flexibility_ (i.e. how "curvy" it is) and how __interpretable__ it is - Simpler, parametric form of the model `\(\Rightarrow\)` the easier it is to interpret --- ## Model flexibility vs interpretability <img src="http://www.stat.cmu.edu/~pfreeman/flexibility.png" width="50%" style="display: block; margin: auto;" /> - __Parametric__ models, for which we can write down a mathematical expression for `\(f(X)\)` __before observing the data__, _a priori_ (e.g. linear regression), __are inherently less flexible__ -- - __Nonparametric__ models, in which `\(f(X)\)` is __estimated from the data__ (e.g. kernel regression) --- ## K Nearest Neighbors (KNN) - Find the `\(k\)` data points __closest__ to an observation `\(x\)`, use these to predit - Need to use some measure of distance, e.g., Euclidean distance -- - KNN is data-driven, but we can actually write down the model _a priori_ -- - Regression: $$ {\hat Y} \vert X = \frac{1}{k} \sum_{i=1}^k Y_i \,, $$ - Classification: $$ \hat{P}[Y = j \vert X] = \frac{1}{k} \sum_{i=1}^k 1(Y_i = j) \,, $$ - `\(1(\cdot)\)` is the indicator function: returns 1 if TRUE, and 0 otherwise. - Summation yields the proportion of neighbors that are of class `\(j\)` --- ## Finding the optimal number of neighbors `\(k\)` __The number of neighbors `\(k\)` is a tuning parameter__ (like `\(\lambda\)` is for ridge / lasso) -- Determining the optimal value of `\(k\)` requires balancing bias and variance: - If `\(k\)` is too small, the resulting model is *too flexible*, - low bias (it is right on average...if we apply KNN to an infinite number of datasets sampled from the same parent population) - high variance (the predictions have a large spread in values when we apply KNN to our infinite data). See the panels to the left on the next slide. -- - If `\(k\)` is too large, the resulting model is *not flexible enough*, - high bias (wrong on average) and - low variance (nearly same predictions, every time). See the panels to the right on the next slide. --- ## Finding the optimal number of neighbors `\(k\)` <img src="http://www.stat.cmu.edu/~pfreeman/Fig_3.16.png" width="40%" style="display: block; margin: auto;" /> <img src="http://www.stat.cmu.edu/~pfreeman/Fig_2.16.png" width="40%" style="display: block; margin: auto;" /> (Figures 3.16 [top] and 2.16 [bottom], *Introduction to Statistical Learning* by James et al.) --- ## What does KNN remind you of?... <img src="https://media1.giphy.com/media/12Gyz2J1b9SjD2/200w.gif" width="40%" style="display: block; margin: auto;" /> --- ## Kernels A kernel `\(K(x)\)` is a weighting function used in estimators, and technically has only one required property: - `\(K(x) \geq 0\)` for all `\(x\)` However, in the manner that kernels are used in statistics, there are two other properties that are usually satisfied: - `\(\int_{-\infty}^\infty K(x) dx = 1\)`; and - `\(K(-x) = K(x)\)` for all `\(x\)`. In short: __a kernel is a symmetric PDF!__ --- ## Kernel density estimation __Goal__: estimate the PDF `\(f(x)\)` for all possible values (assuming it is continuous / smooth) -- $$ \text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) $$ -- - `\(n =\)` sample size, `\(x =\)` new point to estimate `\(f(x)\)` (does NOT have to be in dataset!) -- - `\(h =\)` __bandwidth__, analogous to histogram bin width, ensures `\(\hat{f}(x)\)` integrates to 1 - `\(x_i =\)` `\(i\)`th observation in dataset -- - `\(K_h(x - x_i)\)` is the __Kernel__ function, creates __weight__ given distance of `\(i\)`th observation from new point - as `\(|x - x_i| \rightarrow \infty\)` then `\(K_h(x - x_i) \rightarrow 0\)`, i.e. further apart `\(i\)`th row is from `\(x\)`, smaller the weight - as __bandwidth__ `\(h \uparrow\)` weights are more evenly spread out (as `\(h \downarrow\)` more concentrated around `\(x\)`) - typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `\(\propto e^{-(x - x_i)^2 / 2h^2}\)` - `\(K_h(x - x_i)\)` is large when `\(x_i\)` is close to `\(x\)` --- ## Commonly Used Kernels <img src="http://www.stat.cmu.edu/~pfreeman/kernels.png" width="40%" style="display: block; margin: auto;" /> A general rule of thumb: the choice of kernel will have little effect on estimation, particularly if the sample size is large! The Gaussian kernel (i.e., a normal PDF) is by far the most common choice, and is the default for `R` functions that utilize kernels. --- ## Kernel regression We can apply kernels in the regression setting as well as in the density estimation setting! The classic kernel regression estimator is the __Nadaraya-Watson__ estimator: `$$\hat{y}_h(x) = \sum_{i=1}^n w_i(x) Y_i \,,$$` where `$$w_i(x) = \frac{K\left(\frac{x-X_i}{h}\right)}{\sum_{j=1}^n K\left(\frac{x-X_j}{h}\right)} \,.$$` Regression estimate is the average of all the *weighted* observed response values; - Farther `\(x\)` is from observation `\(\Rightarrow\)` less weight that observation has in determining the regression estimate at `\(x\)` --- ## Kernel regression __Nadaraya-Watson kernel regression__ - given training data with explanatory variable `\(x\)` and continuous response `\(y\)` - _bandwidth_ `\(h > 0\)` - and a new point `\((x_{new}, y_{new})\)`: `$$\hat{y}_{new} = \sum_{i=1}^n w_i(x_{new})\cdot y_i \,,$$` where `$$w_i(x) = \frac{K_h\left(|x_{new}-x_i|\right)}{\sum_{j=1}^n K_h\left(|x_{new}-x_j|\right)} \text{ with } K_h(x) = K(\frac{x}{h})$$` Example of a __linear smoother__ - class of models where predictions are _weighted_ sums of the response variable --- ## Local regression We can fit a linear model __at each point `\(x_{new}\)`__ with weights given by kernel function centered on `\(x_{new}\)` - we can additionally combine this with _polynomial regression_ -- Local regression of the `\(k^{th}\)` order with kernel function `\(K\)` solves the following: `$$\hat{\beta}(x_{new}) = \underset{\beta}{\text{arg min}}\Big\{ \sum_i K_h(|x_{new} - x_i|) \cdot (y_i - \sum_{j=0}^k x_i^k \cdot \beta_k )^2 \Big\}$$` -- __Yes, this means every single observation has its own set of coefficients__ -- Predicted value is then: `$$\hat{y}_{new} = \sum_{j=0}^k x_{new}^k \cdot \hat{\beta}_k(x_{new})$$` -- Smoother predictions than kernel regression but comes at __higher computational cost__ - __LOESS__ replaces kernel with k nearest neighbors - faster than local regression but discontinuities when neighbors change --- ## Smoothing splines Use __smooth function__ `\(s(x)\)` to predict `\(y\)`, control smoothness directly by minimizing the __spline objective function__: `$$\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int(s''(x))^2dx$$` -- `$$= \text{fit data} + \text{impose smoothness}$$` -- `$$\Rightarrow \text{model fit} = \text{likelihood} - \lambda \cdot \text{wiggliness}$$` Estimate the __smoothing spline__ `\(\hat{s}(x)\)` that __balances the tradeoff between the model fit and wiggliness__ -- <img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffsmooth-1.png?raw=true" width="70%" style="display: block; margin: auto;" /> --- ## Basis functions Splines are _piecewise cubic polynomials_ with __knots__ (boundary points for functions) at every data point -- Practical alternative: linear combination of set of __basis functions__ -- __Cubic polynomial example__: define four basis functions: - `\(B_1(x) = 1\)`, `\(B_2(x) = x\)`, `\(B_3(x) = x^2\)`, `\(B_4(x) = x^3\)` where the regression function `\(r(x)\)` is written as: `$$r(x) = \sum_j^4 \beta_j B_j(x)$$` -- - __linear in the transformed variables__ `\(B_1(x), \dots, B_4(x)\)` but it is __nonlinear in `\(x\)`__ -- We extend this idea for splines _piecewise_ using indicator functions so the spline is a weighted sum: `$$s(x) = \sum_j^m \beta_j B_j(x)$$` --- ## Number of basis functions is another tuning parameter <img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffbasis-1.png?raw=true" width="80%" style="display: block; margin: auto;" /> --- ## Generalized additive models (GAMs) GAMs were created by [Trevor Hastie and Rob Tibshirani in 1986](https://projecteuclid.org/euclid.ss/1177013604) with intuitive construction: - relationships between individual explanatory variables and the response variable are smooth (either linear or nonlinear via basis functions) - estimate the smooth relationships __simultaneously__ to predict the response by just adding them up __Generalized__ like GLMs where `\(g()\)` is the link function for the expected value of the response `\(E(Y)\)` and __additive__ over the `\(p\)` variables: `$$g(E(Y)) = \beta_0 + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)$$` <img src="https://multithreaded.stitchfix.com/assets/images/blog/fig1.svg" width="70%" style="display: block; margin: auto;" /> -- - can be a convenient balance between flexibility and interpretability - you can combine linear and nonlinear terms! --- ## Example: predicting MLB HR probability Used the [`baseballr`](http://billpetti.github.io/baseballr/) package to scrape all batted-balls from 2022 season for EDA project: ```r library(tidyverse) batted_ball_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/eda_projects/mlb_batted_balls_2022.csv") %>% mutate(is_hr = as.numeric(events == "home_run")) %>% filter(!is.na(launch_angle), !is.na(launch_speed), !is.na(is_hr)) head(batted_ball_data) ``` ``` ## # A tibble: 6 x 32 ## player_name batter stand events hc_x hc_y hit_distance_sc launch_speed launch_angle ## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Daza, Yonath… 602074 R force_o… 103. 150. 18 97.4 -7 ## 2 Robles, Vict… 645302 R single 58.6 120. 158 80.2 14 ## 3 Hoerner, Nico 663538 R field_o… 99.3 166. 20 101. -9 ## 4 Clemens, Kody 665019 L field_o… 126. 191. 165 84 64 ## 5 Rosario, Amed 642708 R field_o… 97.4 170. 9 94.3 -14 ## 6 Castro, Willi 650489 L sac_fly 178. 58.9 369 96 29 ## # … with 23 more variables: hit_location <dbl>, bb_type <chr>, barrel <dbl>, ## # pitch_type <chr>, release_speed <dbl>, effective_speed <dbl>, ## # if_fielding_alignment <chr>, of_fielding_alignment <chr>, game_date <date>, ## # balls <dbl>, strikes <dbl>, outs_when_up <dbl>, on_1b <dbl>, on_2b <dbl>, on_3b <dbl>, ## # inning <dbl>, inning_topbot <chr>, home_score <dbl>, away_score <dbl>, ## # post_home_score <dbl>, post_away_score <dbl>, des <chr>, is_hr <dbl> ``` ```r table(batted_ball_data$is_hr) ``` ``` ## ## 0 1 ## 6702 333 ``` --- ## Predict HRs with launch angle and exit velocity? .pull-left[ ```r batted_ball_data %>% ggplot(aes(x = launch_speed, y = launch_angle, color = as.factor(is_hr))) + geom_point(alpha = 0.5) + ggthemes::scale_color_colorblind(labels = c("No", "Yes")) + labs(x = "Exit velocity", y = "Launch angle", color = "HR?") + theme_bw() + theme(legend.position = "bottom") ``` - HRs are relatively rare and confined to one area of this plot ] .pull-right[ <img src="18-np-gams_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- ## Fitting GAMs with [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html) First set-up training data ```r set.seed(2004) batted_ball_data <- batted_ball_data %>% mutate(is_train = sample(rep(0:1, length.out = nrow(batted_ball_data)))) ``` Next fit the initial function using smooth functions via `s()`: ```r library(mgcv) *init_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle), data = filter(batted_ball_data, is_train == 1), * family = binomial, method = "REML") ``` - Use [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) instead of the default for more stable solution --- ### GAM summary ```r summary(init_logit_gam) ``` ``` ## ## Family: binomial ## Link function: logit ## ## Formula: ## is_hr ~ s(launch_speed) + s(launch_angle) ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -26.96 10.31 -2.614 0.00895 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(launch_speed) 1.000 1.000 151.5 <2e-16 *** ## s(launch_angle) 2.962 3.305 112.0 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.588 Deviance explained = 68.3% ## -REML = 231.49 Scale est. = 1 n = 3517 ``` --- ### Visualizing partial response functions with [gratia](https://gavinsimpson.github.io/gratia/index.html) Displays the partial effect of each term in the model `\(\Rightarrow\)` add up to the overall prediction ```r library(gratia) draw(init_logit_gam) ``` <img src="18-np-gams_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> --- ## Convert to probability scale with `plogis` function ```r *draw(init_logit_gam, fun = plogis) ``` <img src="18-np-gams_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> -- - centered on average value of 0.5 because it's the partial effect without the intercept --- ## Include intercept in plot... ```r *draw(init_logit_gam, fun = plogis, constant = coef(init_logit_gam)[1]) ``` <img src="18-np-gams_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> Intercept reflects relatively rare occurence of HRs! --- ## Model checking for number of basis functions Use `gam.check()` to see if we need more basis functions based on an approximate test ```r gam.check(init_logit_gam) ``` ``` ## ## Method: REML Optimizer: outer newton ## full convergence after 11 iterations. ## Gradient range [-5.632542e-05,-2.964163e-06] ## (score 231.4864 & scale 1). ## Hessian positive definite, eigenvalue range [5.631851e-05,0.8679399]. ## Model rank = 19 / 19 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(launch_speed) 9.00 1.00 1.05 1.00 ## s(launch_angle) 9.00 2.96 0.97 0.08 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Check the predictions? ```r batted_ball_data <- batted_ball_data %>% mutate(init_gam_hr_prob = as.numeric(predict(init_logit_gam, newdata = batted_ball_data, type = "response")), init_gam_hr_class = as.numeric(init_gam_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == init_gam_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.977 ## 2 1 0.972 ``` --- ## What about the linear model? ```r init_linear_logit <- glm(is_hr ~ launch_speed + launch_angle, data = filter(batted_ball_data, is_train == 1), family = binomial) batted_ball_data <- batted_ball_data %>% mutate(init_glm_hr_prob = predict(init_linear_logit, newdata = batted_ball_data, type = "response"), init_glm_hr_class = as.numeric(init_glm_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == init_glm_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.960 ## 2 1 0.951 ``` __Very few situations in reality where linear regressions perform better than an additive model using smooth functions__ - especially since smooth functions can just capture linear models... --- ## Continuous interactions as a smooth surface ```r multi_logit_gam <- gam(is_hr ~ s(launch_speed, launch_angle), data = filter(batted_ball_data, is_train == 1), family = binomial) ``` Plot the predicted heatmap: ```r draw(multi_logit_gam) ``` <img src="18-np-gams_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> --- ## Check the predictions? ```r batted_ball_data <- batted_ball_data %>% mutate(multi_gam_hr_prob = as.numeric(predict(multi_logit_gam, newdata = batted_ball_data, type = "response")), multi_gam_hr_class = as.numeric(multi_gam_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == multi_gam_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.976 ## 2 1 0.972 ``` - This has one smoothing parameter for the 2D smooth... --- ### Separate interactions from individual terms with tensor smooths ```r tensor_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle) + * ti(launch_speed, launch_angle), data = filter(batted_ball_data, is_train == 1), family = binomial) ``` -- Check the predictions ```r batted_ball_data %>% mutate(tensor_gam_hr_prob = as.numeric(predict(tensor_logit_gam, newdata = batted_ball_data, type = "response")), tensor_gam_hr_class = as.numeric(tensor_gam_hr_prob >= 0.5)) %>% group_by(is_train) %>% summarize(correct = mean(is_hr == tensor_gam_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.977 ## 2 1 0.972 ``` -- __More complicated model but yet it does not help!__ --- ## Recap and useful resources .pull-left[ <blockquote class="twitter-tweet"><p lang="en" dir="ltr">140 char vrsn<br><br>1 GAMs are just GLMs<br>2 GAMs fit wiggly terms<br>3 use + s(foo) not foo in frmla<br>4 use method = "REML"<br>5 gam.check()</p>— Dr Gavin Simpson 😷🇪🇺🇩🇰 (@ucfagls) <a href="https://twitter.com/ucfagls/status/842444686513991680?ref_src=twsrc%5Etfw">March 16, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] .pull-right[ - [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/) - [`mgcv` course](https://eric-pedersen.github.io/mgcv-esa-workshop/) - [Stitch Fix post: GAM: The Predictive Modeling Silver Bullet](https://multithreaded.stitchfix.com/blog/2015/07/30/gam/) - Chapters 7 and 8 of [Advanced Data Analysis from an Elementary Point of View by Prof Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf) - I strongly recommend you download this book, and you will refer back to it for the rest of your life ] --- class: inverse center middle # BONUS CONTENT --- ## KNN in context Here are two quotes from ISLR to keep in mind when thinking about KNN: - "As a general rule, parametric methods [like linear regression] will tend to outperform non-parametric approaches [like KNN] when there is a small number of observations per predictor." This is the *curse of dimensionality*: for data-driven models, the amount of data you need to get similar model performance goes up exponentially with `\(p\)`. -- `\(\Rightarrow\)` KNN might not be a good model to learn when the number of predictor variables is very large. -- - "Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model..." -- `\(\Rightarrow\)` KNN is not the best model to learn if inference is the goal of an analysis. --- ## KNN: two critical points to remember 1. To determine which neighbors are the nearest neighbors, pairwise Euclidean distances are computed...so we may need to scale (or standardize) the individual predictor variables so that the distances are not skewed by that one predictor that has the largest variance. -- 2. Don't blindly compute a pairwise distance matrix! For instance, if `\(n\)` = 100,000, then your pairwise distance matrix will have `\(10^{10}\)` elements, each of which uses 8 bytes in memory...resulting in a memory usage of 80 GB! Your laptop cannot handle this. It can barely handle 1-2 GB at this point. If `\(n\)` is large, you have three options: a. subsample your data, limiting `\(n\)` to be `\(\lesssim\)` 15,000-20,000; b. use a variant of KNN that works with sparse matrices (matrices that can be compressed since most values are zero); or c. make use of a "kd tree" to more effectively (but only approximately) identify nearest neighbors. The [`FNN` package in `R`](https://daviddalpiaz.github.io/r4sl/knn-reg.html) has an option to search for neighbors via the use of a kd tree. -- But instead we will use the [`caret`](http://topepo.github.io/caret/index.html) package... --- ## Example data: MLB 2021 batting statistics Downloaded MLB 2021 batting statistics leaderboard from [Fangraphs](https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2019&month=0&season1=2019&ind=0) ```r library(tidyverse) mlb_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/fg_batting_2021.csv") head(mlb_data) ``` ``` ## # A tibble: 6 x 23 ## Name Team G PA HR R RBI SB `BB%` `K%` ISO BABIP AVG OBP SLG ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Vlad… TOR 82 354 27 66 69 2 14.4% 17.2% 0.336 0.346 0.336 0.438 0.671 ## 2 Fern… SDP 68 288 27 66 58 18 12.5% 28.1% 0.395 0.333 0.302 0.385 0.698 ## 3 Carl… HOU 79 347 16 61 52 0 13.5% 17.0% 0.231 0.324 0.298 0.398 0.529 ## 4 Marc… TOR 82 372 21 63 54 10 8.9% 23.9% 0.256 0.329 0.286 0.349 0.542 ## 5 Rona… ATL 78 342 23 67 51 16 13.2% 24.3% 0.313 0.306 0.278 0.386 0.592 ## 6 Shoh… LAA 82 322 31 60 67 12 11.2% 28.0% 0.418 0.29 0.277 0.363 0.695 ## # … with 8 more variables: wOBA <dbl>, xwOBA <dbl>, wRC+ <dbl>, BsR <dbl>, Off <dbl>, ## # Def <dbl>, WAR <dbl>, playerid <dbl> ``` --- ## Data cleaning - [`janitor`](http://sfirke.github.io/janitor/) package has convenient functions for data cleaning like `clean_names()` - `parse_number()` function provides easy way to convert character to numeric columns ```r library(janitor) mlb_data_clean <- clean_names(mlb_data) mlb_data_clean <- mlb_data_clean %>% mutate_at(vars(bb_percent:k_percent), parse_number) head(mlb_data_clean) ``` ``` ## # A tibble: 6 x 23 ## name team g pa hr r rbi sb bb_percent k_percent iso babip avg ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Vladimi… TOR 82 354 27 66 69 2 14.4 17.2 0.336 0.346 0.336 ## 2 Fernand… SDP 68 288 27 66 58 18 12.5 28.1 0.395 0.333 0.302 ## 3 Carlos … HOU 79 347 16 61 52 0 13.5 17 0.231 0.324 0.298 ## 4 Marcus … TOR 82 372 21 63 54 10 8.9 23.9 0.256 0.329 0.286 ## 5 Ronald … ATL 78 342 23 67 51 16 13.2 24.3 0.313 0.306 0.278 ## 6 Shohei … LAA 82 322 31 60 67 12 11.2 28 0.418 0.29 0.277 ## # … with 10 more variables: obp <dbl>, slg <dbl>, w_oba <dbl>, xw_oba <dbl>, w_rc <dbl>, ## # bs_r <dbl>, off <dbl>, def <dbl>, war <dbl>, playerid <dbl> ``` --- ## KNN example `caret` is a package of functions designed to simplify training, tuning, and testing statistical learning methods - first create partitions for training and test data using `createDataPartition()` ```r library(caret) set.seed(1960) train_i <- createDataPartition(y = mlb_data_clean$w_oba, p = 0.7, list = FALSE) %>% as.numeric() train_mlb_data <- mlb_data_clean[train_i,] test_mlb_data <- mlb_data_clean[-train_i,] ``` -- - next [`train()`](http://topepo.github.io/caret/model-training-and-tuning.html) to find the optimal `k` on the training data with cross-validation ```r set.seed(1971) init_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, data = train_mlb_data, method = "knn", trControl = trainControl("cv", number = 10), preProcess = c("center", "scale"), tuneLength = 10) ``` --- ## KNN example ```r ggplot(init_knn_mlb_train) + theme_bw() ``` <img src="18-np-gams_files/figure-html/plot-knn-1.png" width="504" style="display: block; margin: auto;" /> --- ## KNN example Can manually create a __tuning grid__ to search over for the tuning parameter `k` ```r set.seed(1979) tune_knn_mlb_train <- train(w_oba ~ bb_percent + k_percent + iso, data = train_mlb_data, method = "knn", trControl = trainControl("cv", number = 10), preProcess = c("center", "scale"), * tuneGrid = expand.grid(k = 2:20)) tune_knn_mlb_train$results ``` ``` ## k RMSE Rsquared MAE RMSESD RsquaredSD MAESD ## 1 2 0.02664407 0.4918627 0.02190051 0.005738257 0.2369480 0.005007983 ## 2 3 0.02396789 0.6196434 0.01953764 0.004799971 0.1591891 0.004490957 ## 3 4 0.02390454 0.6207057 0.01964823 0.004899709 0.1827772 0.004354430 ## 4 5 0.02336004 0.6244300 0.01938406 0.005161427 0.1949702 0.004402360 ## 5 6 0.02303508 0.6335212 0.01917729 0.004972606 0.1561728 0.004147938 ## 6 7 0.02366376 0.6262434 0.01946073 0.005400162 0.1666846 0.004507068 ## 7 8 0.02424653 0.6101839 0.01990244 0.005227669 0.1568779 0.004293239 ## 8 9 0.02421224 0.6337331 0.01979337 0.005645555 0.1611892 0.004766752 ## 9 10 0.02415043 0.6448377 0.01986623 0.005677580 0.1629994 0.005095074 ## 10 11 0.02464093 0.6455271 0.02023478 0.005640372 0.1534025 0.004948348 ## 11 12 0.02487926 0.6445562 0.02042024 0.005492106 0.1663580 0.004889936 ## 12 13 0.02504601 0.6374670 0.02045053 0.005894155 0.1922161 0.005138379 ## 13 14 0.02501372 0.6494843 0.02040749 0.005674674 0.1756100 0.004930679 ## 14 15 0.02486026 0.6691205 0.02037055 0.005720033 0.1546572 0.004830014 ## 15 16 0.02508766 0.6651779 0.02044740 0.005831991 0.1872543 0.004866796 ## 16 17 0.02533878 0.6609326 0.02061879 0.006109175 0.1821065 0.005149361 ## 17 18 0.02545599 0.6582352 0.02055717 0.006196558 0.1896244 0.005195552 ## 18 19 0.02542914 0.6665400 0.02057135 0.005954863 0.1888456 0.004914146 ## 19 20 0.02600403 0.6528360 0.02103544 0.005907427 0.1993802 0.004836106 ``` --- ## KNN example ```r ggplot(tune_knn_mlb_train) + theme_bw() ``` <img src="18-np-gams_files/figure-html/plot-knn-tune-1.png" width="504" style="display: block; margin: auto;" /> --- ## KNN example ```r tune_knn_mlb_train$bestTune ``` ``` ## k ## 5 6 ``` ```r test_preds <- predict(tune_knn_mlb_train, test_mlb_data) head(test_preds) ``` ``` ## [1] 0.3861667 0.3736667 0.3738333 0.3771667 0.3381667 0.3716667 ``` ```r RMSE(test_preds, test_mlb_data$w_oba) ``` ``` ## [1] 0.02631488 ``` --- ## Kernel regression with `np` Use the `npregbw` function to tune bandwidth using [__generalized cross-validation__](https://bookdown.org/egarpor/NP-UC3M/kre-i-bwd.html#kre-i-bwd-cv) ```r library(np) mlb_bw0 <- npregbw(w_oba ~ bb_percent + k_percent + iso, data = train_mlb_data) ``` Generate predictions with `npreg` with provided bandwidth object ```r *mlb_test_npreg <- npreg(mlb_bw0, newdata = test_mlb_data) RMSE(mlb_test_npreg$mean, test_mlb_data$w_oba) ``` ``` ## [1] 0.02107194 ```