class: center, middle, inverse, title-slide .title[ # Big Data Bowl Workshop ] .subtitle[ ## Sponsored by SumerSports ] .author[ ### Ron Yurko ] .author[ ### Assistant Teaching Professor ] .author[ ### Department of Statistics & Data Science, Carnegie Mellon University ] .date[ ### 10/28/2022 ] --- ## What is the Big Data Bowl? .pull-left[ - The premier sports analytics competition [hosted by the NFL on Kaggle](https://www.kaggle.com/c/nfl-big-data-bowl-2023) - Every year the NFL releases a sample a player-tracking data with competition theme, e.g., this year's theme is linemen on passing plays - Pipeline for careers in sports analytics! - Competition tracks for undergraduates, creating a metric, and creating a presentation designed for coaches - $100,000 in total prize money ] .pull-right[ <img src="https://operations.nfl.com/media/4196/logo_bdb.svg?mode=max&width=200" width="80%" /> ] --- ## What are the different [datasets](https://www.kaggle.com/competitions/nfl-big-data-bowl-2023/data) provided to you? ### Contextual Information 1. __`games.csv`__ : info about teams playing in game, IDs: `gameId` 2. __`plays.csv`__ : play-level context, e.g., down, field position, PFF play context, IDs: `gameId`, `playId` 3. __`players.csv`__ : player-level information, e.g., position, height, weight, IDs: `nflId` 4. __`pffScoutingData.csv`__ : PFF charted info about players in plays, e.g., player's role, outcome responsibility, IDs: `gameId`, `playId`, `nflId` -- ### Tracking data __week[X].csv__ : contains player-tracking data for week `X` for weeks 1-8 during 2021 NFL season, + IDs: `gameId`, `playId`, `nflId`, `frameId` + contains locations, speed, direction, orientation, etc. of every player and ball at 10Hz --- ## What does player-tracking data look like? ```r library(tidyverse) week1_raw_data <- read_csv("data/nfl-big-data-bowl-2023/week1.csv") head(week1_raw_data) ``` ``` ## # A tibble: 6 × 16 ## gameId playId nflId frameId time jerse…¹ team playD…² x ## <dbl> <dbl> <dbl> <dbl> <dttm> <dbl> <chr> <chr> <dbl> ## 1 2.02e9 97 25511 1 2021-09-10 00:26:31 12 TB right 37.8 ## 2 2.02e9 97 25511 2 2021-09-10 00:26:31 12 TB right 37.8 ## 3 2.02e9 97 25511 3 2021-09-10 00:26:31 12 TB right 37.8 ## 4 2.02e9 97 25511 4 2021-09-10 00:26:31 12 TB right 37.7 ## 5 2.02e9 97 25511 5 2021-09-10 00:26:31 12 TB right 37.7 ## 6 2.02e9 97 25511 6 2021-09-10 00:26:31 12 TB right 37.6 ## # … with 7 more variables: y <dbl>, s <dbl>, a <dbl>, dis <dbl>, o <dbl>, ## # dir <dbl>, event <chr>, and abbreviated variable names ¹jerseyNumber, ## # ²playDirection ``` --- <img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3258%2F820e86013d48faacf33b7a32a15e814c%2FIncreasing%20Dir%20and%20O.png?generation=1572285857588233&alt=media" width="100%" /> --- ## Standard pre-processing steps - Positional information is currently absolute, need to standardize with respect to the direction of the offense ```r week1_adj_data <- week1_raw_data %>% # Flip positional values mutate(x = ifelse(playDirection == "left", 120 - x, x), y = ifelse(playDirection == "left", 160 / 3 - y, y)) ``` -- Similarly need to flip player direction and orientation: ```r week1_adj_data <- week1_adj_data %>% mutate(dir = ifelse(playDirection == "left", dir + 180, dir), dir = ifelse(dir > 360, dir - 360, dir), o = ifelse(playDirection == "left", o + 180, o), o = ifelse(o > 360, o - 360, o)) ``` --- ## You need to visualize the data! Let's pick a random example play... ```r week1_plays <- read_csv("data/nfl-big-data-bowl-2023/plays.csv") %>% filter(gameId %in% unique(week1_adj_data$gameId)) set.seed(2010) example_play <- week1_plays %>% dplyr::select(gameId, playId, playDescription) %>% sample_n(1) example_play ``` ``` ## # A tibble: 1 × 3 ## gameId playId playDescription ## <dbl> <dbl> <chr> ## 1 2021091211 2327 (3:14) (Shotgun) J.Winston pass short middle to C.Hogan for… ``` -- Use `inner_join` to get the example play tracking data: ```r example_play_tracking_data <- week1_adj_data %>% inner_join(example_play, by = c("gameId", "playId")) ``` --- ## Figure out what visualization / animation works for you... [Vanilla Lopez](https://github.com/ryurko/Big-Data-Bowl) version starts with setting up a background... .pull-left[ ```r # General field boundaries xmin <- 0 xmax <- 160/3 hash_right <- 38.35 hash_left <- 12 hash_width <- 3.3 # Specific boundaries for a given play ymin <- max(round(min(example_play_tracking_data$x, na.rm = TRUE) - 10, -1), 0) ymax <- min(round(max(example_play_tracking_data$x, na.rm = TRUE) + 10, -1), 120) # Hash marks df_hash <- expand.grid(x = c(0, 23.36667, 29.96667, xmax), y = (10:110)) %>% filter(!(floor(y %% 5) == 0), y < ymax, y > ymin) ``` ] .pull-right[ ```r field_base <- ggplot() + annotate("text", x = df_hash$x[df_hash$x < 55/2], y = df_hash$y[df_hash$x < 55/2], label = "_", hjust = 0, vjust = -0.2) + annotate("text", x = df_hash$x[df_hash$x > 55/2], y = df_hash$y[df_hash$x > 55/2], label = "_", hjust = 1, vjust = -0.2) + annotate("segment", x = xmin, y = seq(max(10, ymin), min(ymax, 110), by = 5), xend = xmax, yend = seq(max(10, ymin), min(ymax, 110), by = 5)) + annotate("text", x = rep(hash_left, 11), y = seq(10, 110, by = 10), label = c("G ", seq(10, 50, by = 10), rev(seq(10, 40, by = 10)), " G"), angle = 270, size = 4) + annotate("text", x = rep((xmax - hash_left), 11), y = seq(10, 110, by = 10), label = c(" G", seq(10, 50, by = 10), rev(seq(10, 40, by = 10)), "G "), angle = 90, size = 4) + annotate("segment", x = c(xmin, xmin, xmax, xmax), y = c(ymin, ymax, ymax, ymin), xend = c(xmin, xmax, xmax, xmin), yend = c(ymax, ymax, ymin, ymin), color = "black") ``` ] --- ### Animate using [`gganimate`](https://gganimate.com/index.html) - useful for debugging! .pull-left[ ```r library(gganimate) play_animation <- field_base + geom_point(data = example_play_tracking_data, aes(x = (xmax - y), y = x, shape = team, fill = team, group = nflId, size = team, color = team), alpha = 0.7) + geom_text(data = example_play_tracking_data, aes(x = (xmax-y), y = x, label = jerseyNumber), color = "white", vjust = 0.36, size = 3.5) + scale_size_manual(values = c(4, 6, 6), guide = FALSE) + scale_shape_manual(values = c(16, 21, 21), guide = FALSE) + scale_fill_manual(values = c("brown", "#FFB612", "#101820"), guide = FALSE) + scale_color_manual(values = c("brown", "#FFB612", "#101820"), guide = FALSE) + ylim(ymin, ymax) + coord_fixed() + cowplot::theme_nothing() + theme(plot.title = element_text()) + transition_time(frameId) + ease_aes('linear') + NULL ex_play_length <- length(unique(example_play_tracking_data$frameId)) animate(play_animation, fps = 10, nframe = ex_play_length) ``` ] .pull-right[ <img src="ex_play_animation.gif" width="100%" style="display: block; margin: auto;" /> ] --- ## Example project idea: modeling sack probability __GOAL__: Explore modeling the probability of a sack in continuous-time #### What are my steps? -- 1. Identify / create the response variable of interest -- 2. Filter data to frames of interest, i.e., focus on moments of time between snap to QB decision / sack -- 3. __Feature engineering!__ What summaries can I compute with the tracking that can be useful for modeling my outcome variable of interest? -- 4. Build and evaluate models __out-of-sample__! Want to understand contributions of constructed features and different levels of model flexibility -- 5. Create player/team metrics based on models -- 6. Create presentation of results: what's the story, key takeaway messages, limitations --- ## Creating the response variable Need to identify for each play whether or not it ended in a sack -- Can use the `passResult` columns from the `plays.csv` dataset: ```r table(week1_plays$passResult) ``` ``` ## ## C I IN R S ## 654 379 18 54 70 ``` -- Make a table with the IDs plus binary indicator for sacks: ```r sack_outcome_table <- week1_plays %>% dplyr::select(gameId, playId, passResult) %>% mutate(is_sack = as.numeric(passResult == "S")) sack_outcome_table %>% slice(1:2) ``` ``` ## # A tibble: 2 × 4 ## gameId playId passResult is_sack ## <dbl> <dbl> <chr> <dbl> ## 1 2021090900 97 I 0 ## 2 2021090900 137 C 0 ``` --- ## How do we filter tracking data to frames of interest? Decide what frames to use based on `event` column: ```r table(week1_adj_data$event) ``` ``` ## ## autoevent_ballsnap autoevent_passforward autoevent_passinterrupted ## 13455 12627 690 ## ball_snap first_contact fumble ## 26956 230 46 ## fumble_offense_recovered handoff huddle_break_offense ## 23 23 23 ## lateral line_set man_in_motion ## 23 552 598 ## None pass_arrived pass_forward ## 1028537 1150 24127 ## pass_outcome_caught pass_outcome_incomplete pass_tipped ## 69 138 299 ## play_action qb_sack qb_strip_sack ## 5382 1334 184 ## run shift ## 1219 437 ``` --- ## Start with passer only data... Need to identify the passer on each play - can use `pffScoutingData` for this! ```r pff_data <- read_csv("data/nfl-big-data-bowl-2023/pffScoutingData.csv") play_passers <- pff_data %>% filter(pff_role == "Pass") %>% dplyr::select(gameId, playId, nflId) %>% mutate(is_passer = 1) ``` -- Join to tracking data and filter to only passers: ```r passer_tracking_data <- week1_adj_data %>% left_join(play_passers, by = c("gameId", "playId", "nflId")) %>% filter(!is.na(is_passer)) passer_tracking_data %>% slice(1:3) ``` ``` ## # A tibble: 3 × 17 ## gameId playId nflId frameId time jerse…¹ team playD…² x ## <dbl> <dbl> <dbl> <dbl> <dttm> <dbl> <chr> <chr> <dbl> ## 1 2.02e9 97 25511 1 2021-09-10 00:26:31 12 TB right 37.8 ## 2 2.02e9 97 25511 2 2021-09-10 00:26:31 12 TB right 37.8 ## 3 2.02e9 97 25511 3 2021-09-10 00:26:31 12 TB right 37.8 ## # … with 8 more variables: y <dbl>, s <dbl>, a <dbl>, dis <dbl>, o <dbl>, ## # dir <dbl>, event <chr>, is_passer <dbl>, and abbreviated variable names ## # ¹jerseyNumber, ²playDirection ``` --- ## Identify starting and ending frames of interest First mark what could be the start and end based on `event`: ```r passer_tracking_data <- passer_tracking_data %>% mutate(is_start = as.numeric(event %in% c("autoevent_ballsnap", "ball_snap")), # Now all of the options for the end of the ball carrier sequence: is_end = as.numeric(event %in% c("fumble", "handoff", "lateral", "autoevent_passforward", "pass_forward", "qb_sack", "qb_strip_sack", "run"))) ``` -- Create table with the first observed start and end frames within each play: ```r frame_seq_info <- passer_tracking_data %>% group_by(gameId, playId) %>% mutate(any_start = any(is_start == 1), any_end = any(is_end == 1)) %>% filter(any_start, any_end) %>% summarize(start_frame = frameId[which(is_start == 1)[1]], end_frame = frameId[which(is_end == 1 & frameId > start_frame)[1]], .groups = "drop") frame_seq_info %>% slice(1) ``` ``` ## # A tibble: 1 × 4 ## gameId playId start_frame end_frame ## <dbl> <dbl> <dbl> <dbl> ## 1 2021090900 97 6 38 ``` --- ## Join this information and filter tracking data ```r passer_tracking_seq_data <- passer_tracking_data %>% left_join(frame_seq_info, by = c("gameId", "playId")) %>% filter(!is.na(start_frame), !is.na(end_frame), frameId >= start_frame, frameId <= end_frame) ``` -- Create info about the passer for each frame: ```r passer_info <- passer_tracking_seq_data %>% dplyr::select(gameId, playId, frameId, nflId, x, y) %>% rename(passer_nflId = nflId, passer_x = x, passer_y = y) passer_info %>% slice(1:2) ``` ``` ## # A tibble: 2 × 6 ## gameId playId frameId passer_nflId passer_x passer_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2021090900 97 6 25511 37.6 24.3 ## 2 2021090900 97 7 25511 37.6 24.3 ``` --- ## Create features about pass blockers and rushers I have information about the passer - __what about their protection and opposition__? -- Similar to before: identify using `pffScoutingData`: ```r play_block_rush <- pff_data %>% filter(pff_role %in% c("Pass Block", "Pass Rush")) %>% dplyr::select(gameId, playId, nflId, pff_role) %>% mutate(pff_role = str_remove(pff_role, "Pass ")) ``` -- Join to tracking data along with passer info: ```r block_rush_tracking_data <- week1_adj_data %>% left_join(play_block_rush, by = c("gameId", "playId", "nflId")) %>% filter(!is.na(pff_role)) %>% left_join(passer_info, by = c("gameId", "playId", "frameId")) %>% filter(!is.na(passer_x)) ``` My `passer_info` table already captures my frames of interest, so I can just filter out missing values from this join --- ## Create features about closest pass blocker and rusher First need to compute the distance between each player with the passer: ```r long_block_rush_info <- block_rush_tracking_data %>% mutate(dist_to_passer = sqrt((x - passer_x)^2 + (y - passer_y)^2)) %>% group_by(gameId, playId, frameId, pff_role) %>% arrange(dist_to_passer) %>% mutate(player_dist_rank = 1:n()) %>% ungroup() ``` -- For today, keep things simple and get the closest blocker and rusher: ```r long_closest_block_rush_info <- long_block_rush_info %>% filter(player_dist_rank == 1) %>% dplyr::select(gameId, playId, frameId, pff_role, dist_to_passer, x, y, s) long_closest_block_rush_info %>% slice(1:3) ``` ``` ## # A tibble: 3 × 8 ## gameId playId frameId pff_role dist_to_passer x y s ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 2021091203 765 53 Rush 0 11.4 25.2 1.4 ## 2 2021091201 1446 45 Rush 0.0400 53.4 18.8 2.95 ## 3 2021091207 3828 41 Rush 0.0400 69.2 26.3 2.36 ``` --- <img src="https://media2.giphy.com/media/oCjCwnuLpiWbfMb1UA/giphy.gif" width="120%" style="display: block; margin: auto;" /> --- ## Create a wide model dataset Use `pivot_wider` to convert dataset from long to wide (separate columns for nearest blocker and rusher): ```r wide_closest_block_rush_info <- long_closest_block_rush_info %>% pivot_wider(names_from = pff_role, values_from = dist_to_passer:s, names_glue = "{pff_role}_{.value}") wide_closest_block_rush_info %>% slice(1:2) ``` ``` ## # A tibble: 2 × 11 ## gameId playId frameId Rush_…¹ Block…² Rush_x Block_x Rush_y Block_y Rush_s ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2021091203 765 53 0 0.440 11.4 11.0 25.2 25.5 1.4 ## 2 2021091201 1446 45 0.0400 1.96 53.4 55.3 18.8 18.6 2.95 ## # … with 1 more variable: Block_s <dbl>, and abbreviated variable names ## # ¹Rush_dist_to_passer, ²Block_dist_to_passer ``` -- Join everything together to make model dataset! ```r sack_model_dataset <- passer_info %>% inner_join(wide_closest_block_rush_info, by = c("gameId", "playId", "frameId")) %>% left_join(sack_outcome_table, by = c("gameId", "playId")) ``` --- ## MACHINE LEARNING TIME??? -- <img src="https://media.tenor.com/X5oXG4jhV4YAAAAC/not-yet-mace.gif" width="120%" style="display: block; margin: auto;" /> --- ## Always perform basic EDA before modeling! .pull-left[ View nearest rush distance as a function of time: ```r sack_model_dataset <- sack_model_dataset %>% group_by(gameId, playId) %>% mutate(adj_frame = frameId - min(frameId)) %>% ungroup() ``` New variable which could be predictive itself of sack probability... Beneficial to view features as function of time within play + e.g., plot the average distance of the nearest pass rusher (+/- standard error) ] -- .pull-right[ ```r sack_model_dataset %>% group_by(adj_frame, is_sack) %>% summarize(ave_rush_dist = mean(Rush_dist_to_passer), n_plays = n(), se = sd(Rush_dist_to_passer) / sqrt(n_plays), .groups = "drop") %>% filter(n_plays > 1) %>% mutate(ave_upper = ave_rush_dist + se, ave_lower = pmax(0, ave_rush_dist - se)) %>% ggplot(aes(x = adj_frame, y = ave_rush_dist, color = as.factor(is_sack))) + geom_point(alpha = 0.5) + geom_errorbar(aes(ymin = ave_lower, ymax = ave_upper), alpha = 0.5) + ggthemes::scale_color_colorblind() + labs(x = "Frame within play", y = "Ave distance from passer for nearest pass rusher", color = "Sack indicator") + theme_bw() + theme(legend.position = "bottom") ``` ] --- <img src="slides_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Onto model building... Binary outcome model `\(Y \in\)` {Not sacked (0), Sacked (1)} Want to evaluate models out of sample, so I will assign games to different test folds: ```r set.seed(1985) game_fold_table <- tibble(gameId = unique(sack_model_dataset$gameId)) %>% mutate(game_fold = sample(rep(1:4, length.out = n()), n())) game_fold_table %>% slice(1:3) ``` ``` ## # A tibble: 3 × 2 ## gameId game_fold ## <dbl> <int> ## 1 2021090900 2 ## 2 2021091200 2 ## 3 2021091201 3 ``` -- Join back to modeling dataset: ```r sack_model_dataset <- sack_model_dataset %>% dplyr::left_join(game_fold_table, by = "gameId") ``` --- ## Generate baseline predictions without features ```r baseline_cv_preds <- map_dfr(unique(sack_model_dataset$game_fold), function(test_fold) { # Separate test and training data: test_data <- sack_model_dataset %>% filter(game_fold == test_fold) train_data <- sack_model_dataset %>% filter(game_fold != test_fold) # Train model: logit_model <- glm(is_sack ~ 1, data = train_data, family = "binomial") # Return tibble of holdout results: tibble(test_pred_probs = predict(logit_model, newdata = test_data, type = "response"), test_actual = test_data$is_sack, adj_frame = test_data$adj_frame, game_fold = test_fold) }) %>% mutate(model_type = "baseline") ``` --- ## Simple [GAM](https://noamross.github.io/gams-in-r-course/) based on smooth function of nearest pass rusher distance ```r library(mgcv) gam_cv_preds <- map_dfr(unique(sack_model_dataset$game_fold), function(test_fold) { # Separate test and training data: test_data <- sack_model_dataset %>% filter(game_fold == test_fold) train_data <- sack_model_dataset %>% filter(game_fold != test_fold) # Train model: logit_gam <- gam(is_sack ~ s(Rush_dist_to_passer), data = train_data, family = "binomial") # Return tibble of holdout results: tibble(test_pred_probs = predict(logit_gam, newdata = test_data, type = "response"), test_actual = test_data$is_sack, adj_frame = test_data$adj_frame, game_fold = test_fold) }) %>% mutate(model_type = "GAM: nearest rusher distance") ``` --- ## Random forests using probability forest in [`ranger`](https://github.com/imbs-hl/ranger) ```r library(ranger) rf_prob_cv_preds <- map_dfr(unique(sack_model_dataset$game_fold), function(test_fold) { test_data <- sack_model_dataset %>% filter(game_fold == test_fold) train_data <- sack_model_dataset %>% filter(game_fold != test_fold) rf_prob_model <- ranger(is_sack ~ ., data = dplyr::select(train_data, -gameId, -playId, -frameId, -passResult, -passer_nflId, -game_fold), probability = TRUE) # Return tibble of holdout results: tibble(test_pred_probs = as.numeric(predict(rf_prob_model, data = test_data, type = "response")$predictions[,2]), test_actual = test_data$is_sack, adj_frame = test_data$adj_frame, game_fold = test_fold) }) %>% mutate(model_type = "Random Forests: kitchen sink") ``` --- ## Evaluate predictions across time into play .pull-left[ ```r bind_rows(baseline_cv_preds, gam_cv_preds, rf_prob_cv_preds) %>% group_by(model_type, adj_frame) %>% summarize(brier_score = mean((test_actual - test_pred_probs)^2), n_plays = n(), .groups = "drop") %>% filter(n_plays > 1) %>% ggplot(aes(x = adj_frame, y = brier_score, alpha = n_plays)) + geom_bar(stat = "identity", fill = "darkred") + facet_wrap(~model_type, ncol = 1) + labs(x = "Frame within play", y = "Holdout Brier Score") + theme_bw() + theme(legend.position = "bottom") ``` ] -- .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### How can I create a player / team metric based on this work? -- ### What would a coach be interested to see from this? -- #### Think about finding example plays / players to focus on and emphasize your takeaway messages -- _Limitations?_ + I only used one week of data! Helps to process and build modeling datasets one week at a time, then concatenate together (e.g., using [`purrr`](https://purrr.tidyverse.org/) functions) + So many options for the types of features to create - I went with very simple distance based features... + What about my response variable? + What about time??? --- ## Useful resources to check out! .pull-left[ - [`ngscleanR`](https://github.com/guga31bb/ngscleanR) package: basically performs the initial clean and pre-processing steps considered here - Various tutorials by [Tom Bliss](https://twitter.com/DataWithBliss) and [Mike Lopez](https://twitter.com/StatsbyLopez) on Kaggle: + [2021 Tutorial](https://www.kaggle.com/code/tombliss/tutorial) + [NFL tracking: wrangling, Voronoi, and sonars](https://www.kaggle.com/code/statsbymichaellopez/nfl-tracking-wrangling-voronoi-and-sonars) + [Analyzing Place Kicker Offset from Center - R](https://www.kaggle.com/code/tombliss/analyzing-place-kicker-offset-from-center-r) + [Pass Rushing Edge Get Off Speed](https://www.kaggle.com/code/tombliss/pass-rushing-edge-get-off-speed) - [My example project from 2021 with ghosting](https://github.com/ryurko/bdb-pass-defense) - [Deep Learning? Pass coverage example by Ben Baldwin](https://www.opensourcefootball.com/posts/2021-05-31-computer-vision-in-r-using-torch/) - And look up winning posts! ] -- .pull-right[ <img src="https://media4.giphy.com/media/a3phMYtwN7syY/200w.gif" width="80%" style="display: block; margin: auto;" /> ]