Big Data Bowl Workshop

class: center, middle, inverse, title-slide

.title[
# Big Data Bowl Workshop
]
.subtitle[
## Sponsored by SumerSports
]
.author[
### Ron Yurko
]
.author[
### Assistant Teaching Professor
]
.author[
### Department of Statistics & Data Science, Carnegie Mellon University
]
.date[
### 10/28/2022
]

---

## What is the Big Data Bowl?

.pull-left[

- The premier sports analytics competition [hosted by the NFL on Kaggle](https://www.kaggle.com/c/nfl-big-data-bowl-2023)

- Every year the NFL releases a sample a player-tracking data with competition theme, e.g., this year's theme is linemen on passing plays

- Pipeline for careers in sports analytics!

- Competition tracks for undergraduates, creating a metric, and creating a presentation designed for coaches

- $100,000 in total prize money

]
.pull-right[

]

---

## What are the different [datasets](https://www.kaggle.com/competitions/nfl-big-data-bowl-2023/data) provided to you?

### Contextual Information

1. __`games.csv`__ : info about teams playing in game, IDs: `gameId`

2. __`plays.csv`__ : play-level context, e.g., down, field position, PFF play context, IDs: `gameId`, `playId`

3. __`players.csv`__ : player-level information, e.g., position, height, weight, IDs: `nflId`

4. __`pffScoutingData.csv`__ : PFF charted info about players in plays, e.g., player's role, outcome responsibility, IDs: `gameId`, `playId`, `nflId`

### Tracking data

__week[X].csv__ : contains player-tracking data for week `X` for weeks 1-8 during 2021 NFL season,

+ IDs: `gameId`, `playId`, `nflId`, `frameId`

+ contains locations, speed, direction, orientation, etc. of every player and ball at 10Hz

---

## What does player-tracking data look like?

```r
library(tidyverse)
week1_raw_data <- read_csv("data/nfl-big-data-bowl-2023/week1.csv")
head(week1_raw_data)
```

```
## # A tibble: 6 × 16
##      gameId playId nflId frameId time                jerse…¹ team  playD…²     x
##       <dbl>  <dbl> <dbl>   <dbl> <dttm>                <dbl> <chr> <chr>   <dbl>
## 1    2.02e9     97 25511       1 2021-09-10 00:26:31      12 TB    right    37.8
## 2    2.02e9     97 25511       2 2021-09-10 00:26:31      12 TB    right    37.8
## 3    2.02e9     97 25511       3 2021-09-10 00:26:31      12 TB    right    37.8
## 4    2.02e9     97 25511       4 2021-09-10 00:26:31      12 TB    right    37.7
## 5    2.02e9     97 25511       5 2021-09-10 00:26:31      12 TB    right    37.7
## 6    2.02e9     97 25511       6 2021-09-10 00:26:31      12 TB    right    37.6
## # … with 7 more variables: y <dbl>, s <dbl>, a <dbl>, dis <dbl>, o <dbl>,
## #   dir <dbl>, event <chr>, and abbreviated variable names ¹jerseyNumber,
## #   ²playDirection
```

---

---

## Standard pre-processing steps

- Positional information is currently absolute, need to standardize with respect to the direction of the offense

```r
week1_adj_data <- week1_raw_data %>%
  # Flip positional values
  mutate(x = ifelse(playDirection == "left", 120 - x, x),
         y = ifelse(playDirection == "left", 160 / 3 - y, y))
```

Similarly need to flip player direction and orientation:

```r
week1_adj_data <- week1_adj_data %>%
  mutate(dir = ifelse(playDirection == "left", dir + 180, dir),
         dir = ifelse(dir > 360, dir - 360, dir),
         o = ifelse(playDirection == "left", o + 180, o),
         o = ifelse(o > 360, o - 360, o))
```

---

## You need to visualize the data!

Let's pick a random example play...

```r
week1_plays <- read_csv("data/nfl-big-data-bowl-2023/plays.csv") %>%
  filter(gameId %in% unique(week1_adj_data$gameId))

set.seed(2010)
example_play <- week1_plays %>%
  dplyr::select(gameId, playId, playDescription) %>% 
  sample_n(1)
example_play
```

```
## # A tibble: 1 × 3
##       gameId playId playDescription                                             
##        <dbl>  <dbl> <chr>                                                       
## 1 2021091211   2327 (3:14) (Shotgun) J.Winston pass short middle to C.Hogan for…
```

Use `inner_join` to get the example play tracking data:

```r
example_play_tracking_data <- week1_adj_data %>%
  inner_join(example_play, by = c("gameId", "playId"))
```

---

## Figure out what visualization / animation works for you...

[Vanilla Lopez](https://github.com/ryurko/Big-Data-Bowl) version starts with setting up a background...

.pull-left[

```r
# General field boundaries
xmin <- 0
xmax <- 160/3
hash_right <- 38.35
hash_left <- 12
hash_width <- 3.3

# Specific boundaries for a given play
ymin <- max(round(min(example_play_tracking_data$x, 
                      na.rm = TRUE) - 10, -1), 0)
ymax <- min(round(max(example_play_tracking_data$x, 
                      na.rm = TRUE) + 10, -1), 120)

# Hash marks
df_hash <- 
  expand.grid(x = c(0, 23.36667, 29.96667, xmax), 
                       y = (10:110)) %>% 
  filter(!(floor(y %% 5) == 0), y < ymax, y > ymin)
```

]

.pull-right[

```r
field_base <- ggplot() +
  annotate("text", x = df_hash$x[df_hash$x < 55/2],
           y = df_hash$y[df_hash$x < 55/2], label = "_", hjust = 0, vjust = -0.2) +
  annotate("text", x = df_hash$x[df_hash$x > 55/2],
           y = df_hash$y[df_hash$x > 55/2], label = "_", hjust = 1, vjust = -0.2) +
  annotate("segment", x = xmin, y = seq(max(10, ymin), min(ymax, 110), by = 5),
           xend =  xmax, yend = seq(max(10, ymin), min(ymax, 110), by = 5)) +
  annotate("text", x = rep(hash_left, 11), y = seq(10, 110, by = 10),
           label = c("G   ", seq(10, 50, by = 10), rev(seq(10, 40, by = 10)), "   G"),
           angle = 270, size = 4) +
  annotate("text", x = rep((xmax - hash_left), 11), y = seq(10, 110, by = 10),
           label = c("   G", seq(10, 50, by = 10), rev(seq(10, 40, by = 10)), "G   "),
           angle = 90, size = 4) +
  annotate("segment", x = c(xmin, xmin, xmax, xmax), y = c(ymin, ymax, ymax, ymin),
           xend = c(xmin, xmax, xmax, xmin), yend = c(ymax, ymax, ymin, ymin), color = "black")
```

]

---

### Animate using [`gganimate`](https://gganimate.com/index.html) - useful for debugging!

.pull-left[

```r
library(gganimate)
play_animation <- field_base +
  geom_point(data = example_play_tracking_data, 
             aes(x = (xmax - y), y = x,
                 shape = team, fill = team,
                 group = nflId, size = team, color = team),
             alpha = 0.7) +
  geom_text(data = example_play_tracking_data, 
            aes(x = (xmax-y), y = x, label = jerseyNumber), 
            color = "white", vjust = 0.36, size = 3.5) +
  scale_size_manual(values = c(4, 6, 6), guide = FALSE) +
  scale_shape_manual(values = c(16, 21, 21), guide = FALSE) +
  scale_fill_manual(values = c("brown", "#FFB612", "#101820"), guide = FALSE) +
  scale_color_manual(values = c("brown", "#FFB612", "#101820"), guide = FALSE) +
  ylim(ymin, ymax) + coord_fixed() +
  cowplot::theme_nothing() + theme(plot.title = element_text()) +
  transition_time(frameId) + ease_aes('linear') + NULL
ex_play_length <- length(unique(example_play_tracking_data$frameId))
animate(play_animation, fps = 10, nframe = ex_play_length)
```

]

.pull-right[
<img src="ex_play_animation.gif" width="100%" style="display: block; margin: auto;" />
]

---

## Example project idea: modeling sack probability

__GOAL__: Explore modeling the probability of a sack in continuous-time

#### What are my steps?

1. Identify / create the response variable of interest

2. Filter data to frames of interest, i.e., focus on moments of time between snap to QB decision / sack

3. __Feature engineering!__ What summaries can I compute with the tracking that can be useful for modeling my outcome variable of interest?

4. Build and evaluate models __out-of-sample__! Want to understand contributions of constructed features and different levels of model flexibility

5. Create player/team metrics based on models

6. Create presentation of results: what's the story, key takeaway messages, limitations

---

## Creating the response variable

Need to identify for each play whether or not it ended in a sack

Can use the `passResult` columns from the `plays.csv` dataset:

```r
table(week1_plays$passResult)
```

```
## 
##   C   I  IN   R   S 
## 654 379  18  54  70
```

Make a table with the IDs plus binary indicator for sacks:

```r
sack_outcome_table <- week1_plays %>%
  dplyr::select(gameId, playId, passResult) %>%
  mutate(is_sack = as.numeric(passResult == "S"))
sack_outcome_table %>% slice(1:2)
```

```
## # A tibble: 2 × 4
##       gameId playId passResult is_sack
##        <dbl>  <dbl> <chr>        <dbl>
## 1 2021090900     97 I                0
## 2 2021090900    137 C                0
```

---

## How do we filter tracking data to frames of interest?

Decide what frames to use based on `event` column:

```r
table(week1_adj_data$event)
```

```
## 
##        autoevent_ballsnap     autoevent_passforward autoevent_passinterrupted 
##                     13455                     12627                       690 
##                 ball_snap             first_contact                    fumble 
##                     26956                       230                        46 
##  fumble_offense_recovered                   handoff      huddle_break_offense 
##                        23                        23                        23 
##                   lateral                  line_set             man_in_motion 
##                        23                       552                       598 
##                      None              pass_arrived              pass_forward 
##                   1028537                      1150                     24127 
##       pass_outcome_caught   pass_outcome_incomplete               pass_tipped 
##                        69                       138                       299 
##               play_action                   qb_sack             qb_strip_sack 
##                      5382                      1334                       184 
##                       run                     shift 
##                      1219                       437
```

---

## Start with passer only data...

Need to identify the passer on each play - can use `pffScoutingData` for this!

```r
pff_data <- read_csv("data/nfl-big-data-bowl-2023/pffScoutingData.csv")
play_passers <- pff_data %>% 
  filter(pff_role == "Pass") %>%
  dplyr::select(gameId, playId, nflId) %>%
  mutate(is_passer = 1)
```

Join to tracking data and filter to only passers:

```r
passer_tracking_data <- week1_adj_data %>%
  left_join(play_passers, by = c("gameId", "playId", "nflId")) %>%
  filter(!is.na(is_passer))
passer_tracking_data %>% slice(1:3)
```

```
## # A tibble: 3 × 17
##      gameId playId nflId frameId time                jerse…¹ team  playD…²     x
##       <dbl>  <dbl> <dbl>   <dbl> <dttm>                <dbl> <chr> <chr>   <dbl>
## 1    2.02e9     97 25511       1 2021-09-10 00:26:31      12 TB    right    37.8
## 2    2.02e9     97 25511       2 2021-09-10 00:26:31      12 TB    right    37.8
## 3    2.02e9     97 25511       3 2021-09-10 00:26:31      12 TB    right    37.8
## # … with 8 more variables: y <dbl>, s <dbl>, a <dbl>, dis <dbl>, o <dbl>,
## #   dir <dbl>, event <chr>, is_passer <dbl>, and abbreviated variable names
## #   ¹jerseyNumber, ²playDirection
```

---

## Identify starting and ending frames of interest

First mark what could be the start and end based on `event`:

```r
passer_tracking_data <- passer_tracking_data %>%
  mutate(is_start = as.numeric(event %in% c("autoevent_ballsnap", "ball_snap")),
         # Now all of the options for the end of the ball carrier sequence:
         is_end = as.numeric(event %in% c("fumble", "handoff", "lateral",
                                          "autoevent_passforward", "pass_forward",
                                          "qb_sack", "qb_strip_sack", "run")))
```

Create table with the first observed start and end frames within each play:

```r
frame_seq_info <- passer_tracking_data %>% group_by(gameId, playId) %>%
  mutate(any_start = any(is_start == 1), any_end = any(is_end == 1)) %>%
  filter(any_start, any_end) %>%
  summarize(start_frame = frameId[which(is_start == 1)[1]],
            end_frame = frameId[which(is_end == 1 & frameId > start_frame)[1]], .groups = "drop")
frame_seq_info %>% slice(1)
```

```
## # A tibble: 1 × 4
##       gameId playId start_frame end_frame
##        <dbl>  <dbl>       <dbl>     <dbl>
## 1 2021090900     97           6        38
```

---

## Join this information and filter tracking data

```r
passer_tracking_seq_data <- passer_tracking_data %>%
  left_join(frame_seq_info, by = c("gameId", "playId")) %>%
  filter(!is.na(start_frame), !is.na(end_frame),
         frameId >= start_frame, frameId <= end_frame)
```

Create info about the passer for each frame:

```r
passer_info <- passer_tracking_seq_data %>%
  dplyr::select(gameId, playId, frameId, nflId, x, y) %>%
  rename(passer_nflId = nflId, passer_x = x, passer_y = y)
passer_info %>% slice(1:2)
```

```
## # A tibble: 2 × 6
##       gameId playId frameId passer_nflId passer_x passer_y
##        <dbl>  <dbl>   <dbl>        <dbl>    <dbl>    <dbl>
## 1 2021090900     97       6        25511     37.6     24.3
## 2 2021090900     97       7        25511     37.6     24.3
```

---

## Create features about pass blockers and rushers

I have information about the passer - __what about their protection and opposition__?

Similar to before: identify using `pffScoutingData`:

```r
play_block_rush <- pff_data %>% 
  filter(pff_role %in% c("Pass Block", "Pass Rush")) %>%
  dplyr::select(gameId, playId, nflId, pff_role) %>%
  mutate(pff_role = str_remove(pff_role, "Pass "))
```

Join to tracking data along with passer info:

```r
block_rush_tracking_data <- week1_adj_data %>%
  left_join(play_block_rush, by = c("gameId", "playId", "nflId")) %>%
  filter(!is.na(pff_role)) %>%
  left_join(passer_info, by = c("gameId", "playId", "frameId")) %>%
  filter(!is.na(passer_x))
```

My `passer_info` table already captures my frames of interest, so I can just filter out missing values from this join

---

## Create features about closest pass blocker and rusher

First need to compute the distance between each player with the passer:

```r
long_block_rush_info <- block_rush_tracking_data %>%
  mutate(dist_to_passer = sqrt((x - passer_x)^2 + (y - passer_y)^2)) %>%
  group_by(gameId, playId, frameId, pff_role) %>%
  arrange(dist_to_passer) %>%
  mutate(player_dist_rank = 1:n()) %>%
  ungroup()
```

For today, keep things simple and get the closest blocker and rusher:

```r
long_closest_block_rush_info <- long_block_rush_info %>%
  filter(player_dist_rank == 1) %>%
  dplyr::select(gameId, playId, frameId, pff_role,
                dist_to_passer, x, y, s)
long_closest_block_rush_info %>% slice(1:3)
```

```
## # A tibble: 3 × 8
##       gameId playId frameId pff_role dist_to_passer     x     y     s
##        <dbl>  <dbl>   <dbl> <chr>             <dbl> <dbl> <dbl> <dbl>
## 1 2021091203    765      53 Rush             0       11.4  25.2  1.4 
## 2 2021091201   1446      45 Rush             0.0400  53.4  18.8  2.95
## 3 2021091207   3828      41 Rush             0.0400  69.2  26.3  2.36
```

---

---

## Create a wide model dataset

Use `pivot_wider` to convert dataset from long to wide (separate columns for nearest blocker and rusher):

```r
wide_closest_block_rush_info <- long_closest_block_rush_info %>%
  pivot_wider(names_from = pff_role, values_from = dist_to_passer:s,
              names_glue = "{pff_role}_{.value}")
wide_closest_block_rush_info %>% slice(1:2)
```

```
## # A tibble: 2 × 11
##       gameId playId frameId Rush_…¹ Block…² Rush_x Block_x Rush_y Block_y Rush_s
##        <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
## 1 2021091203    765      53  0        0.440   11.4    11.0   25.2    25.5   1.4 
## 2 2021091201   1446      45  0.0400   1.96    53.4    55.3   18.8    18.6   2.95
## # … with 1 more variable: Block_s <dbl>, and abbreviated variable names
## #   ¹Rush_dist_to_passer, ²Block_dist_to_passer
```

Join everything together to make model dataset!

```r
sack_model_dataset <- passer_info %>%
  inner_join(wide_closest_block_rush_info,
             by = c("gameId", "playId", "frameId")) %>%
  left_join(sack_outcome_table, by = c("gameId", "playId"))
```

---

## MACHINE LEARNING TIME???

---

## Always perform basic EDA before modeling!

.pull-left[

View nearest rush distance as a function of time:

```r
sack_model_dataset <- sack_model_dataset %>%
  group_by(gameId, playId) %>%
  mutate(adj_frame = frameId - min(frameId)) %>%
  ungroup()
```

New variable which could be predictive itself of sack probability...

Beneficial to view features as function of time within play

+ e.g., plot the average distance of the nearest pass rusher (+/- standard error)

]

.pull-right[

```r
sack_model_dataset %>%
  group_by(adj_frame, is_sack) %>%
  summarize(ave_rush_dist = mean(Rush_dist_to_passer),
            n_plays = n(),
            se = sd(Rush_dist_to_passer) / sqrt(n_plays),
            .groups = "drop") %>%
  filter(n_plays > 1) %>%
  mutate(ave_upper = ave_rush_dist + se, 
         ave_lower = pmax(0, ave_rush_dist - se)) %>%
  ggplot(aes(x = adj_frame, y = ave_rush_dist, 
             color = as.factor(is_sack))) +
  geom_point(alpha = 0.5) +
  geom_errorbar(aes(ymin = ave_lower, ymax = ave_upper),
                alpha = 0.5) +
  ggthemes::scale_color_colorblind() +
  labs(x = "Frame within play", 
       y = "Ave distance from passer for nearest pass rusher",
       color = "Sack indicator") +
  theme_bw() +
  theme(legend.position = "bottom")
```

]

---

---

## Onto model building...

Binary outcome model `$Y \in$` {Not sacked (0), Sacked (1)}

Want to evaluate models out of sample, so I will assign games to different test folds:

```r
set.seed(1985)
game_fold_table <- tibble(gameId = unique(sack_model_dataset$gameId)) %>%
  mutate(game_fold = sample(rep(1:4, length.out = n()), n()))
game_fold_table %>% slice(1:3)
```

```
## # A tibble: 3 × 2
##       gameId game_fold
##        <dbl>     <int>
## 1 2021090900         2
## 2 2021091200         2
## 3 2021091201         3
```

Join back to modeling dataset:

```r
sack_model_dataset <- sack_model_dataset %>% 
  dplyr::left_join(game_fold_table, by = "gameId")
```

---

## Generate baseline predictions without features

```r
baseline_cv_preds <- 
  map_dfr(unique(sack_model_dataset$game_fold), 
          function(test_fold) {
            # Separate test and training data:
            test_data <- sack_model_dataset %>%
              filter(game_fold == test_fold)
            train_data <- sack_model_dataset %>%
              filter(game_fold != test_fold)
            # Train model:
            logit_model <- glm(is_sack ~ 1, 
                               data = train_data, family = "binomial")
            # Return tibble of holdout results:
            tibble(test_pred_probs = predict(logit_model, newdata = test_data,
                                             type = "response"),
                   test_actual = test_data$is_sack,
                   adj_frame = test_data$adj_frame,
                   game_fold = test_fold) 
          }) %>%
  mutate(model_type = "baseline")
```

---

## Simple [GAM](https://noamross.github.io/gams-in-r-course/) based on smooth function of nearest pass rusher distance

```r
library(mgcv)
gam_cv_preds <- 
  map_dfr(unique(sack_model_dataset$game_fold), 
          function(test_fold) {
            # Separate test and training data:
            test_data <- sack_model_dataset %>% filter(game_fold == test_fold)
            train_data <- sack_model_dataset %>% filter(game_fold != test_fold)
            # Train model:
            logit_gam <- gam(is_sack ~ s(Rush_dist_to_passer), 
                               data = train_data, family = "binomial")
            # Return tibble of holdout results:
            tibble(test_pred_probs = predict(logit_gam, newdata = test_data,
                                             type = "response"),
                   test_actual = test_data$is_sack,
                   adj_frame = test_data$adj_frame,
                   game_fold = test_fold) 
          }) %>%
  mutate(model_type = "GAM: nearest rusher distance")
```

---

## Random forests using probability forest in [`ranger`](https://github.com/imbs-hl/ranger)

```r
library(ranger)
rf_prob_cv_preds <- 
  map_dfr(unique(sack_model_dataset$game_fold), 
          function(test_fold) {
            test_data <- sack_model_dataset %>% filter(game_fold == test_fold)
            train_data <- sack_model_dataset %>% filter(game_fold != test_fold)
            
            rf_prob_model <-
              ranger(is_sack ~ ., 
                     data = dplyr::select(train_data, -gameId, -playId, -frameId,
                                          -passResult, -passer_nflId, -game_fold),
                     probability = TRUE)
            # Return tibble of holdout results:
            tibble(test_pred_probs = 
                     as.numeric(predict(rf_prob_model, data = test_data,
                                        type = "response")$predictions[,2]),
                   test_actual = test_data$is_sack,
                   adj_frame = test_data$adj_frame,
                   game_fold = test_fold) 
          }) %>%
  mutate(model_type = "Random Forests: kitchen sink")
```

---

## Evaluate predictions across time into play

.pull-left[

```r
bind_rows(baseline_cv_preds, 
          gam_cv_preds,
          rf_prob_cv_preds) %>%
  group_by(model_type, adj_frame) %>%
  summarize(brier_score = 
              mean((test_actual - 
                      test_pred_probs)^2),
            n_plays = n(),
            .groups = "drop") %>%
  filter(n_plays > 1) %>%
  ggplot(aes(x = adj_frame, y = brier_score, 
             alpha = n_plays)) +
  geom_bar(stat = "identity", fill = "darkred") +
  facet_wrap(~model_type, ncol = 1) +
  labs(x = "Frame within play", 
       y = "Holdout Brier Score") +
  theme_bw() +
  theme(legend.position = "bottom")
```

]

.pull-right[
<img src="slides_files/figure-html/unnamed-chunk-36-1.png" width="100%" style="display: block; margin: auto;" />

]

---

### How can I create a player / team metric based on this work?

### What would a coach be interested to see from this?

#### Think about finding example plays / players to focus on and emphasize your takeaway messages

_Limitations?_

+ I only used one week of data! Helps to process and build modeling datasets one week at a time, then concatenate together (e.g., using [`purrr`](https://purrr.tidyverse.org/) functions)

+ So many options for the types of features to create - I went with very simple distance based features...

+ What about my response variable? 
  
  + What about time???

---

## Useful resources to check out!

.pull-left[

- [`ngscleanR`](https://github.com/guga31bb/ngscleanR) package: basically performs the initial clean and pre-processing steps considered here

- Various tutorials by [Tom Bliss](https://twitter.com/DataWithBliss) and [Mike Lopez](https://twitter.com/StatsbyLopez) on Kaggle:

+ [2021 Tutorial](https://www.kaggle.com/code/tombliss/tutorial)
  
  + [NFL tracking: wrangling, Voronoi, and sonars](https://www.kaggle.com/code/statsbymichaellopez/nfl-tracking-wrangling-voronoi-and-sonars)
  
  + [Analyzing Place Kicker Offset from Center - R](https://www.kaggle.com/code/tombliss/analyzing-place-kicker-offset-from-center-r)
  + [Pass Rushing Edge Get Off Speed](https://www.kaggle.com/code/tombliss/pass-rushing-edge-get-off-speed)
  
- [My example project from 2021 with ghosting](https://github.com/ryurko/bdb-pass-defense)

- [Deep Learning? Pass coverage example by Ben Baldwin](https://www.opensourcefootball.com/posts/2021-05-31-computer-vision-in-r-using-torch/)

- And look up winning posts!

]

.pull-right[

]