class: center, middle, inverse, title-slide # Supervised Learning ## Smoothing splines and GAMs ### Ron Yurko ### 07/09/2020 --- ## Kernel regression __Nadaraya-Watson kernel regression__ - given training data with explanatory variable `\(x\)` and continuous response `\(y\)` - _bandwidth_ `\(h > 0\)` - and a new point `\((x_{new}, y_{new})\)`: `$$\hat{y}_{new} = \sum_{i=1}^n w_i(x_{new})\cdot y_i \,,$$` where `$$w_i(x) = \frac{K_h\left(|x_{new}-x_i|\right)}{\sum_{j=1}^n K_h\left(|x_{new}-x_j|\right)} \text{ with } K_h(x) = K(\frac{x}{h})$$` Example of a __linear smoother__ - class of models where predictions are _weighted_ sums of the response variable --- ## Local regression We can fit a linear model __at each point `\(x_{new}\)`__ with weights given by kernel function centered on `\(x_{new}\)` - we can additionally combine this with _polynomial regression_ -- L ocal regression of the `\(k^{th}\)` order with kernel function `\(K\)` solves the following: `$$\hat{\beta}(x_{new}) = \underset{\beta}{\text{arg min}}\Big\{ \sum_i K_h(|x_{new} - x_i|) \cdot (y_i - \sum_{j=0}^k x_i^k \cdot \beta_k )^2 \Big\}$$` -- __Yes, this means every single observation has its own set of coefficients__ -- Predicted value is then: `$$\hat{y}_{new} = \sum_{j=0}^k x_{new}^k \cdot \hat{\beta}_k(x_{new})$$` -- Smoother predictions than kernel regression but comes at __higher computational cost__ - __LOESS__ replaces kernel with k nearest neighbors - faster than local regression but discontinuities when neighbors change --- ## Smoothing splines Let `\(s(x)\)` be a __smooth function__ to predict `\(y\)` - we control smoothness directly by minimizing the __spline objective function__: `$$\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int(s''(x))^2dx$$` -- `$$= \text{fit data} + \text{impose smoothness}$$` -- `$$\Rightarrow \text{model fit} = \text{likelihood} - \lambda \cdot \text{wiggliness}$$` Estimate the __smoothing spline__ `\(\hat{s}(x)\)` that __balances the tradeoff between the model fit and wiggliness__ -- <img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffsmooth-1.png?raw=true" width="70%" style="display: block; margin: auto;" /> --- ## Basis functions Splines are _piecewise cubic polynomials_ with __knots__ (boundary points for functions) at every data point -- Practical alternative: linear combination of set of __basis functions__ -- __Cubic polynomial example__: define four basis functions: - `\(B_1(x) = 1\)`, `\(B_2(x) = x\)`, `\(B_3(x) = x^2\)`, `\(B_4(x) = x^3\)` where the regression function `\(r(x)\)` is written as: `$$r(x) = \sum_j^4 \beta_j B_j(x)$$` -- - __linear in the transformed variables__ `\(B_1(x), \dots, B_4(x)\)` but it is __nonlinear in `\(x\)`__ -- We extend this idea for splines _piecewise_ using indicator functions so the spline is a weighted sum: `$$s(x) = \sum_j^m \beta_j B_j(x)$$` --- ## Number of basis functions is another tuning parameter <img src="https://github.com/noamross/gams-in-r-course/blob/master/images/diffbasis-1.png?raw=true" width="80%" style="display: block; margin: auto;" /> --- ## Generalized additive models (GAMs) GAMs were created by [Trevor Hastie and Rob Tibshirani in 1986](https://projecteuclid.org/euclid.ss/1177013604) with intuitive construction: - relationships between individual explanatory variables and the response variable are smooth (either linear or nonlinear via basis functions) - estimate the smooth relationships __simultaneously__ to predict the response by just adding them up __Generalized__ like GLMs where `\(g()\)` is the link function for the expected value of the response `\(E(Y)\)` and __additive__ over the `\(p\)` variables: `$$g(E(Y)) = \beta_0 + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)$$` <img src="https://multithreaded.stitchfix.com/assets/images/blog/fig1.svg" width="70%" style="display: block; margin: auto;" /> -- - can be a convenient balance between flexibility and interpretability - you can combine linear and nonlinear terms! --- ## Example: predicting MLB HR probability Used the [`baseballr`](http://billpetti.github.io/baseballr/) package to scrape all batted-balls from 2019 season: ```r library(tidyverse) batted_ball_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/materials/data/mlb_2019_batted_balls.csv") %>% mutate(is_hr = as.numeric(events == "home_run")) head(batted_ball_data) ``` ``` ## # A tibble: 6 x 4 ## events launch_speed launch_angle is_hr ## <chr> <dbl> <dbl> <dbl> ## 1 field_out 89 38 0 ## 2 field_out 89 38 0 ## 3 force_out 84 -13 0 ## 4 field_out 89 38 0 ## 5 single 90 15 0 ## 6 single 90 15 0 ``` ```r table(batted_ball_data$is_hr) ``` ``` ## ## 0 1 ## 120661 6871 ``` --- ## Fitting GAMs with [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html) First set-up training data ```r set.seed(2004) batted_ball_data <- batted_ball_data %>% mutate(is_train = sample(rep(0:1, length.out = nrow(batted_ball_data)))) ``` Next fit the initial function using smooth functions via `s()`: ```r library(mgcv) init_logit_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle), data = filter(batted_ball_data, is_train == 1), family = binomial) ``` --- ## Visualizing partial response functions ```r plot(init_logit_gam, pages = 1) ``` <img src="16-splines-gams_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> --- ## Convert to probability scale ```r plot(init_logit_gam, pages = 1, trans = plogis) ``` <img src="16-splines-gams_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> --- ## Model checking for number of basis functions Use `gam.check()` to see if we need more basis functions based on an approximate test ```r gam.check(init_logit_gam) ``` ``` ## ## Method: UBRE Optimizer: outer newton ## full convergence after 10 iterations. ## Gradient range [-4.058613e-07,-5.76112e-08] ## (score -0.8744526 & scale 1). ## Hessian positive definite, eigenvalue range [5.742399e-08,0.001000449]. ## Model rank = 19 / 19 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(launch_speed) 9.00 1.00 1.00 0.65 ## s(launch_angle) 9.00 4.43 1.01 0.81 ``` --- ## Check the predictions? ```r batted_ball_data <- batted_ball_data %>% mutate(init_gam_hr_prob = predict(init_logit_gam, newdata = batted_ball_data, type = "response"), init_gam_hr_class = as.numeric(init_gam_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == init_gam_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.971 ## 2 1 0.973 ``` --- ## What about the linear model? ```r init_linear_logit <- glm(is_hr ~ launch_speed + launch_angle, data = filter(batted_ball_data, is_train == 1), family = binomial) batted_ball_data <- batted_ball_data %>% mutate(init_glm_hr_prob = predict(init_linear_logit, newdata = batted_ball_data, type = "response"), init_glm_hr_class = as.numeric(init_glm_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == init_glm_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.949 ## 2 1 0.950 ``` __Very few situations in reality where linear regressions do better than an additive model using smooth functions__ --- ## Continuous interactions as a smooth surface ```r multi_logit_gam <- gam(is_hr ~ s(launch_speed, launch_angle), data = filter(batted_ball_data, is_train == 1), family = binomial) ``` Plot the predicted heatmap: ```r plot(multi_logit_gam, scheme = 2) ``` <img src="16-splines-gams_files/figure-html/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" /> --- ## Check the predictions? ```r batted_ball_data <- batted_ball_data %>% mutate(multi_gam_hr_prob = predict(multi_logit_gam, newdata = batted_ball_data, type = "response"), multi_gam_hr_class = as.numeric(multi_gam_hr_prob >= 0.5)) batted_ball_data %>% group_by(is_train) %>% summarize(correct = mean(is_hr == multi_gam_hr_class)) ``` ``` ## # A tibble: 2 x 2 ## is_train correct ## <int> <dbl> ## 1 0 0.971 ## 2 1 0.973 ``` __More complicated model but yet it does not help!__ --- ## Useful resources - Free online course: [GAMs in R by Noam Ross](https://noamross.github.io/gams-in-r-course/) - [Stitch Fix post: GAM: The Predictive Modeling Silver Bullet](https://multithreaded.stitchfix.com/blog/2015/07/30/gam/) - [Gavin Simpson's `gratia` package](https://gavinsimpson.github.io/gratia/index.html) for `ggplot2` visualizations - Chapters 7 and 8 of [Advanced Data Analysis from an Elementary Point of View by Prof Cosma Shalizi](https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf) - I strongly recommend you download this book, and you will refer back to it for the rest of your life