class: center, middle, inverse, title-slide # Advanced topics ## More fun with classification ### July 25th, 2022 --- ## NFL completion probability Binary outcome model: `\(Y \in \{\text{Incomplete }(0), \text{Complete }(1) \}\)` ```r library(tidyverse) nfl_passing_plays <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2022/materials/data/sports/eda_projects/nfl_passing_plays_2021.csv") %>% # Only keep rows with passer and receiver information known: filter(!is.na(passer_player_id), !is.na(receiver_player_id), !is.na(epa), !is.na(air_yards), !is.na(pass_location)) %>% # Combine passer and receiver unique IDs: mutate(passer_name_id = paste0(passer_player_name, ":", passer_player_id), receiver_name_id = paste0(receiver_player_name, ":", receiver_player_id)) ``` -- Create train and test folds based on games: ```r set.seed(1985) game_fold_table <- tibble(game_id = unique(nfl_passing_plays$game_id)) %>% mutate(game_fold = sample(rep(1:5, length.out = n()), n())) nfl_passing_plays <- nfl_passing_plays %>% dplyr::left_join(game_fold_table, by = "game_id") ``` --- ## Logistic regression review Generate data of test predictions with particular model: ```r logit_cv_preds <- map_dfr(unique(nfl_passing_plays$game_fold), function(test_fold) { # Separate test and training data: test_data <- nfl_passing_plays %>% filter(game_fold == test_fold) train_data <- nfl_passing_plays %>% filter(game_fold != test_fold) # Train model: logit_model <- glm(complete_pass ~ yardline_100 + shotgun + air_yards + pass_location + qb_hit, data = train_data,family = "binomial") # Return tibble of holdout results: tibble(test_pred_probs = predict(logit_model, newdata = test_data, type = "response"), test_actual = test_data$complete_pass, game_fold = test_fold) }) ``` --- ## Holdout performance by fold ```r logit_cv_preds %>% mutate(test_pred = ifelse(test_pred_probs < .5, 0, 1)) %>% group_by(game_fold) %>% summarize(mcr = mean(test_pred != test_actual)) ``` ``` ## # A tibble: 5 x 2 ## game_fold mcr ## <int> <dbl> ## 1 1 0.284 ## 2 2 0.291 ## 3 3 0.294 ## 4 4 0.281 ## 5 5 0.307 ``` -- Let's think more carefully about what's going on here... --- ### Evaluating the prediction threshold We can really write our classification as a function of some cutoff `\(c\)`: `$$\hat{Y} = \hat{C}(x)=\left\{\begin{array}{ll} 1 & \hat{p}(x)>c \\ 0 & \hat{p}(x) \leq c \end{array}\right.$$` -- Given the classifications, we can form a confusion matrix: <img src="https://bradleyboehmke.github.io/HOML/images/confusion-matrix.png" width="80%" style="display: block; margin: auto;" /> --- <img src="https://bradleyboehmke.github.io/HOML/images/confusion-matrix.png" width="80%" style="display: block; margin: auto;" /> We want to __maximize__ all of the following (positive means 1, negative means 0): - __Accuracy__: How often is the classifier correct? `\(\frac{TP + TN}{total}\)` -- - __Precision__: How often is it right for predicted positives? `\(\frac{TP}{TP + FP}\)` -- - __Sensitivity__, aka true positive rate (TPR) or power: How often does it detect positives? `\(\frac{TP}{TP + FN}\)` -- - __Specificity__, aka true negative rate (TNR), or 1 - false positive rate (FPR): How often does it detect negatives? `\(\frac{TN}{TN + FP}\)` -- __So how do we handle this?__ --- ### We want to balance with high power and low false positive rate <img src="https://media1.tenor.com/images/f6ec84a0969cb05a08d8c0ae16b847af/tenor.gif?itemid=13078930" width="80%" style="display: block; margin: auto;" /> --- ### Receiver Operating Characteristic (ROC) curve Check all possible values for the cutoff `\(c\)`, plot the power against false positive rate Want to maximize the __area under the curve (AUC)__ <img src="https://bradleyboehmke.github.io/HOML/02-modeling-process_files/figure-html/modeling-process-roc-1.png" width="60%" style="display: block; margin: auto;" /> --- ### [`plotROC`](https://cran.r-project.org/web/packages/plotROC/vignettes/examples.html) and holdout AUC .pull-left[ - `d` stands for disease status (the outcome) - `m` stands for marker (the prediction) ```r library(plotROC) logit_cv_preds %>% ggplot() + * geom_roc(aes(d = test_actual, * m = test_pred_probs), labelround = 4) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + labs(color = "Test fold") with(logit_cv_preds, * MLmetrics::AUC(test_pred_probs, test_actual)) ``` ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ``` ## [1] 0.6957778 ``` ] --- ### [`plotROC`](https://cran.r-project.org/web/packages/plotROC/vignettes/examples.html) and holdout AUC by test fold .pull-left[ ```r logit_cv_preds %>% ggplot() + geom_roc(aes(d = test_actual, m = test_pred_probs, * color = as.factor(game_fold)), * n.cuts = 0) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + ggthemes::scale_color_colorblind() + labs(color = "Test fold") + theme(legend.position = "bottom") logit_cv_preds %>% group_by(game_fold) %>% summarize(auc = MLmetrics::AUC(test_pred_probs, test_actual)) ``` _There is definitely room for improvement..._ ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" /> ``` ## # A tibble: 5 x 2 ## game_fold auc ## <int> <dbl> ## 1 1 0.707 ## 2 2 0.686 ## 3 3 0.699 ## 4 4 0.700 ## 5 5 0.690 ``` ] --- ### Add in varying intercepts with `glmer`? ```r library(lme4) glmer_cv_preds <- map_dfr(unique(nfl_passing_plays$game_fold), function(test_fold) { # Separate test and training data - scale variables: test_data <- nfl_passing_plays %>% filter(game_fold == test_fold) %>% * mutate(yardline_100 = scale(yardline_100), air_yards = scale(air_yards)) train_data <- nfl_passing_plays %>% filter(game_fold != test_fold) %>% * mutate(yardline_100 = scale(yardline_100), air_yards = scale(air_yards)) # Train model: glmer_model <- glmer(complete_pass ~ yardline_100 + shotgun + air_yards + pass_location + qb_hit + * (1|passer_name_id) + (1|receiver_name_id), data = train_data, family = "binomial") # Return tibble of holdout results: tibble(test_pred_probs = predict(glmer_model, newdata = test_data, type = "response", * allow.new.levels = TRUE), test_actual = test_data$complete_pass, game_fold = test_fold) }) ``` --- ### Add in varying intercepts with `glmer`? .pull-left[ ```r glmer_cv_preds %>% ggplot() + geom_roc(aes(d = test_actual, m = test_pred_probs, * color = as.factor(game_fold)), * n.cuts = 0) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + ggthemes::scale_color_colorblind() + labs(color = "Test fold") + theme(legend.position = "bottom") glmer_cv_preds %>% group_by(game_fold) %>% summarize(auc = MLmetrics::AUC(test_pred_probs, test_actual)) ``` _Looks like player-level effects do not help!_ ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> ``` ## # A tibble: 5 x 2 ## game_fold auc ## <int> <dbl> ## 1 1 0.708 ## 2 2 0.687 ## 3 3 0.696 ## 4 4 0.700 ## 5 5 0.693 ``` ] --- ## Tree-based approach? We need to first convert categorical variables into dummy indicators: ```r model_data <- nfl_passing_plays %>% mutate(play_id = 1:n(), complete_pass = as.factor(complete_pass)) %>% dplyr::select(play_id, complete_pass, yardline_100, shotgun, air_yards, qb_hit, game_fold, pass_location) %>% mutate(pass_location_val = 1) %>% pivot_wider(id_cols = play_id:game_fold, names_from = pass_location, values_from = pass_location_val, values_fill = 0) %>% dplyr::select(-play_id) ``` --- ## Random forests using probability forest For each tree compute class proportion in terminal node, then take average across all trees ```r library(ranger) rf_prob_cv_preds <- map_dfr(unique(model_data$game_fold), function(test_fold) { # Separate test and training data - scale variables: test_data <- model_data %>% filter(game_fold == test_fold) train_data <- model_data %>% filter(game_fold != test_fold) rf_prob_model <- ranger(complete_pass ~ ., data = dplyr::select(train_data, -game_fold), probability = TRUE) # Return tibble of holdout results: tibble(test_pred_probs = as.numeric(predict(rf_prob_model, data = test_data, * type = "response")$predictions[,2]), * test_actual = as.numeric(test_data$complete_pass) - 1, game_fold = test_fold) }) ``` --- ### Random forests using probability forest .pull-left[ ```r rf_prob_cv_preds %>% ggplot() + geom_roc(aes(d = test_actual, m = test_pred_probs, * color = as.factor(game_fold)), * n.cuts = 0) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + ggthemes::scale_color_colorblind() + labs(color = "Test fold") + theme(legend.position = "bottom") rf_prob_cv_preds %>% group_by(game_fold) %>% summarize(auc = MLmetrics::AUC(test_pred_probs, test_actual)) ``` _Looks like just a modest improvement_ ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> ``` ## # A tibble: 5 x 2 ## game_fold auc ## <int> <dbl> ## 1 1 0.717 ## 2 2 0.705 ## 3 3 0.707 ## 4 4 0.708 ## 5 5 0.704 ``` ] --- ## XGBoost! ```r library(xgboost) xgb_cv_preds <- map_dfr(unique(model_data$game_fold), function(test_fold) { # Separate test and training data - scale variables: test_data <- model_data %>% filter(game_fold == test_fold) test_data_x <- as.matrix(dplyr::select(test_data, -complete_pass, -game_fold)) train_data <- model_data %>% filter(game_fold != test_fold) train_data_x <- as.matrix(dplyr::select(train_data, -complete_pass, -game_fold)) train_data_y <- as.numeric(train_data$complete_pass) - 1 * xgb_model <- xgboost(data = train_data_x, label = train_data_y, nrounds = 100, max_depth = 3, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1, nthread = 1, objective = 'binary:logistic', eval_metric = 'auc', verbose = 0) # Return tibble of holdout results: tibble(test_pred_probs = as.numeric(predict(xgb_model, newdata = test_data_x, type = "response")), test_actual = as.numeric(test_data$complete_pass) - 1, game_fold = test_fold) }) ``` --- ### XGBoost .pull-left[ ```r xgb_cv_preds %>% ggplot() + geom_roc(aes(d = test_actual, m = test_pred_probs, * color = as.factor(game_fold)), * n.cuts = 0) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + ggthemes::scale_color_colorblind() + labs(color = "Test fold") + theme(legend.position = "bottom") xgb_cv_preds %>% group_by(game_fold) %>% summarize(auc = MLmetrics::AUC(test_pred_probs, test_actual)) ``` _Should actually tune this more..._ ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ``` ## # A tibble: 5 x 2 ## game_fold auc ## <int> <dbl> ## 1 1 0.712 ## 2 2 0.708 ## 3 3 0.711 ## 4 4 0.708 ## 5 5 0.697 ``` ] --- ## All together now... .pull-left[ ```r bind_rows( mutate(logit_cv_preds, type = "logit"), mutate(glmer_cv_preds, type = "glmer"), mutate(rf_prob_cv_preds, type = "RF"), mutate(xgb_cv_preds, type = "XGBoost")) %>% ggplot() + geom_roc(aes(d = test_actual, m = test_pred_probs, * color = type), * n.cuts = 0) + style_roc() + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") + ggthemes::scale_color_colorblind() + labs(color = "Model") + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="22-More-classification_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> Pretty similar performance across all models... ] --- ### Explaining predictions with [SHAP-values](https://christophm.github.io/interpretable-ml-book/shap.html) SHAP-values are based on [Shapley values](https://christophm.github.io/interpretable-ml-book/shapley.html) (an idea from game theory) and are used to measure the contributions from each feature in the model to the prediction for an individual observation Shapley value `\(\phi_i^j\)` for feature value `\(j\)` for observation `\(i\)` can be interpreted as: - the value of feature `\(j\)` contributed `\(\phi_i^j\)` to the prediction of observation `\(i\)` compared to the average prediction for the dataset - linear regression coefficients function in the same way -- Can use them in multiple ways: - View total importance: `\(\frac{1}{n}\sum |\phi_i^j|\)` - View distribution of `\(\phi_i^j\)` for each feature - Plot `\(\phi_i^j\)` against feature value for partial dependence --- ### [`SHAPforxgboost`](https://cran.r-project.org/web/packages/SHAPforxgboost/readme/README.html) Fit model on full data then [extract SHAP-values with `SHAPforxgboost`](https://www.kaggle.com/rikdifos/shap-make-algorithms-interpretable) ```r train_data_x <- as.matrix(dplyr::select(model_data, -complete_pass, -game_fold)) train_data_y <- as.numeric(model_data$complete_pass) - 1 xgb_model <- xgboost(data = train_data_x, label = train_data_y, nrounds = 100, max_depth = 3, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1, nthread = 1, objective = 'binary:logistic', eval_metric = 'auc', verbose = 0) library(SHAPforxgboost) shap_value_list <- shap.values(xgb_model, X_train = train_data_x) shap.plot.summary.wrap1(xgb_model, X = train_data_x) ``` <img src="22-More-classification_files/figure-html/unnamed-chunk-18-1.png" width="504" style="display: block; margin: auto;" /> --- ## Multinomial classification with XGBoost Use same NFL play-by-play dataset as before but get ready for XGBoost... ```r nfl_ep_model_data <- read_rds(url("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/model_pbp_data.rds")) %>% mutate(Next_Score_Half = fct_relevel(Next_Score_Half, "No_Score", "Safety", "Field_Goal", "Touchdown", "Opp_Safety", "Opp_Field_Goal", "Opp_Touchdown"), next_score_label = as.numeric(Next_Score_Half) - 1) model_variables <- c("half_seconds_remaining", "yardline_100", "down", "ydstogo") ``` XGBoost requires the multinomial categories to be numeric starting at 0 --- ### Leave-one-season-out cross-validation ```r xgb_loso_cv_preds <- map_dfr(unique(nfl_ep_model_data$season), function(x) { # Separate test and training data - scale variables: test_data <- nfl_ep_model_data %>% filter(season == x) test_data_x <- as.matrix(dplyr::select(test_data, model_variables)) train_data <- nfl_ep_model_data %>% filter(season != x) train_data_x <- as.matrix(dplyr::select(train_data, model_variables)) train_data_y <- train_data$next_score_label xgb_model <- xgboost(data = train_data_x, label = train_data_y, nrounds = 100, max_depth = 3, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, * subsample = 1, nthread = 1, objective = 'multi:softprob', num_class = 7, eval_metric = 'mlogloss', verbose = 0) xgb_preds <- matrix(predict(xgb_model, test_data_x), ncol = 7, byrow = TRUE) %>% as_tibble() colnames(xgb_preds) <- c("No_Score", "Safety", "Field_Goal", "Touchdown", "Opp_Safety", "Opp_Field_Goal", "Opp_Touchdown") xgb_preds %>% mutate(Next_Score_Half = test_data$Next_Score_Half,season = x) }) ``` --- ### Calibration results for each scoring event ```r ep_cv_loso_calibration_results <- xgb_loso_cv_preds %>% pivot_longer(No_Score:Opp_Touchdown, names_to = "next_score_type", values_to = "pred_prob") %>% mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>% group_by(next_score_type, bin_pred_prob) %>% summarize(n_plays = n(), n_scoring_event = length(which(Next_Score_Half == next_score_type)), bin_actual_prob = n_scoring_event / n_plays, * bin_se = sqrt((bin_actual_prob * (1 - bin_actual_prob)) / n_plays)) %>% ungroup() %>% * mutate(bin_upper = pmin(bin_actual_prob + 2 * bin_se, 1), * bin_lower = pmax(bin_actual_prob - 2 * bin_se, 0)) ``` --- ### Calibration results for each scoring event ```r ep_cv_loso_calibration_results %>% mutate(next_score_type = fct_relevel(next_score_type, "Opp_Safety", "Opp_Field_Goal", "Opp_Touchdown", "No_Score", "Safety", "Field_Goal", "Touchdown"), next_score_type = fct_recode(next_score_type, "-Field Goal (-3)" = "Opp_Field_Goal", "-Safety (-2)" = "Opp_Safety", "-Touchdown (-7)" = "Opp_Touchdown", "Field Goal (3)" = "Field_Goal", "No Score (0)" = "No_Score", "Touchdown (7)" = "Touchdown", "Safety (2)" = "Safety")) %>% ggplot(aes(x = bin_pred_prob, y = bin_actual_prob)) + geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed") + geom_smooth(se = FALSE) + geom_point(aes(size = n_plays)) + geom_errorbar(aes(ymin = bin_lower, ymax = bin_upper)) + #coord_equal() + scale_x_continuous(limits = c(0,1)) + scale_y_continuous(limits = c(0,1)) + labs(size = "Number of plays", x = "Estimated next score probability", y = "Observed next score probability") + theme_bw() + theme(strip.background = element_blank(), axis.text.x = element_text(angle = 90), legend.position = c(1, .05), legend.justification = c(1, 0)) + facet_wrap(~ next_score_type, ncol = 4) ``` --- ### Calibration results for each scoring event <img src="22-More-classification_files/figure-html/unnamed-chunk-20-1.png" width="864" style="display: block; margin: auto;" />