Run, Run, then Run Again

Introduction

Playcalling is among the greatest challenges in sports coaching, particularly in the sport of American football. Poor calls – e.g. the Seahawk’s decision to pass rather than run with 1 yard to go in the closing seconds of Super Bowl XLIX – represent snap decisions that can cause franchises to lose critical games and millions of dollars. In order to avoid such outcomes coaches must be shrewd about what kinds of plays to run. In this project we hope to answer the age-old question of what plays a football coach should call, given the types of the previous three plays in the drive (e.g. run or pass), along with contextual information about field position such as yards-to-go until a touchdown, the strength of the offensive team in terms of passing and rushing, and the difference in score in the game. We find that our fitted models on the covariates perform strongly on in-distribution and out-of-distribution data, compared to a naive baseline model.

Data

We obtained our data from the R packages nflreadr and espnscrapeR. The main feature of the nflreadr package that we used was its NFL play-by-play data, which gives us detail on what was happening at each play in a NFL game. We used play-by-play data from the 2023 NFL season for our analysis, and while the dataset featured 372 features for each play, wed only used a subset of what we thought were relevant features to our research question, such as yards-to-go and point differential. We use espnscrapeR was to obtain the rushing and passing rankings of each team, which we determined by using the total number of rushing and passing yards each team had at the end of the 2023 season.

With these two datasets, we created our own dataset through four pre-processing steps.

  1. In nflreadr, we redefine a successful play as having a positive Expected Points Added (EPA) for the first and second down, and covering the yards needed to achieve a first down for the third and fourth down.

  2. With the dataset from nflreadr, we take the \(k\) previous rows before a play and parse out the play type. We then augment the dataset by adding \(k\) columns, with the \(i\)th added column corresponding to the \(i\)th previous play.

  3. We perform a left-join operation with the espnscrapeR dataset - for the team with the possession, we add its pass and rush ranking derived from the espnscrapeR dataset, resulting in two more additional columns to our dataset.

  4. We filter out all plays that were not successes due to difficulties with counterfactuals, as well as all plays that weren’t a rush or a pass. Thus, our dataset only contains successful rush or pass plays.

This results in the main dataset that we would work with training and building our models.

However, it would be unwise to blindly fit regressions without taking a look at possible patterns in our data, and our exploratory data analysis revealed some things that were intuitive, and some that were interesting.

The first thing we took a look at was the relationship between pass ranking and number of pass attempts, as well as rush ranking and number of rush attempts, based on data from the espnscrapeR dataset. At a glance, these scatterplots seem pretty intuitive - the higher the ranking for a particular aspect of the game, the more attempts that a team would have with it. This is particulary true for rushing, where the scatterplot seems extremely linear, and from a qualitative perspective also makes sense. Things get a bit more interesting when you start looking at the relationship between pass attempts and pass rankings. Though this scatterplot is still relatively linear, it is significantly more scattered than the rushing plot, which may indicate that the quality of your passing offense doesn’t dictate how much you actually pass as much as rushing would. What’s particularly interesting is that the top rushing teams - Ravens, 49ers, and Dolphins - all seem to have lower pass attempts than expected, even though the pass rankings for the Dolphins and 49ers are actually quite high.

We then took a look at the types of plays that were occurring at each down. Recall that the goal of each down is to either obtain the first down or to score the ball, and after the fourth down, teams are required to turn the ball over to the opponent. Thus, from first to third down, it makes sense that the number of pass plays increase significantly, as teams are more desperate to cover the amount of yards needed to get to the first down. Fourth down is special in that there is some selection bias in these plays - NFL teams only really go for fourth down when (1) not doing so would essentially mean the forfeiture of the game, or (2) the number of yards is small enough that the team is willing to risk turning the ball over to the other team for a chance to extend the play. Nevertheless, it is still interesting that teams are more inclined to pass rather than rush on fourth down.

Lastly, we took a look at the relationship between the number of yards to go and the pass/rush attempts at each point. What we see is that under 3 yards the chance of a rush or a pass is generally a coin toss, but as the number of yards increases up to 30, teams are much more likely to pass rather than rush, due to the fact that they are desperate to cover more yards in their plays. Past 30 yards however, and it seems like it a coin toss once again, since the number of yards needed is so high that teams may either opt to make safer plays that won’t result in turnovers (rushes), or still try to cover the amount of yards needed and extend the drive (passes).

Methods

To link our data to our curiosity of play calling strategy, we focused on 4 methods: Logistic Regression, Generalized Additive Modeling, Multinomial Modeling, and Multilevel Modeling. We will use misclassification rate to compare and evaluate the models, and we will use 5-fold cross validation to account for uncertainty.

Logistic Regression

As we assume a linear relationship among our covariates, a logistic regression was a relatively straightforward choice given that the nature of our question involves classifying the next best play, which is a classification problem.

\(y_i ~ f(y_i|\eta_i, ...), \eta_i = \beta_0 + \beta_1(p_{i-1}) + \beta_2(p_{i-2}) + \beta_3(p_{i-3}) + \beta_4(ydstogo) + \beta_5(score diff) + \beta_6(passrank) + \beta_7(rushrank)\)

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9143 0.07784 11.75 7.423e-32
as.factor(prev_play_1)pass -0.294 0.06234 -4.716 2.403e-06
as.factor(prev_play_1)run -0.3718 0.06271 -5.929 3.039e-09
as.factor(prev_play_2)pass -0.5452 0.06517 -8.365 6.017e-17
as.factor(prev_play_2)run -0.4015 0.06431 -6.243 4.299e-10
as.factor(prev_play_3)pass 0.3159 0.05615 5.626 1.846e-08
as.factor(prev_play_3)run 0.5934 0.0576 10.3 6.885e-25
ydstogo -0.1193 0.005033 -23.7 3.763e-124
score_differential 0.01975 0.00177 11.16 6.217e-29
pass_rank 0.01432 0.00193 7.421 1.163e-13
rush_rank -0.01443 0.001882 -7.669 1.737e-14

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 20195 on 14959 degrees of freedom
Residual deviance: 19222 on 14949 degrees of freedom

While all of our coefficients estimates are significant, it seems like the play types of the 3 previous plays brought the most influence on outcome in terms of magnitude. In specific, the 3rd previous play yielded the most change in log-odds of a successful game outcome - the change in the game outcome when the 3rd previous play was a pass compared to when it was the base condition (first_play) was 0.3159 while it was 0.5934 for a run. Since the log odds for the 1st and 2nd previous play was negative, we convert them to an exponential scale and discover that for the 1st previous play, a pass had a higher odds ratio of 0.7452765 than that of the a run. On the second play, however, a pass produced a lower odds ratio of 0.5797258 than the 0.6693153 yielded by a run. Interestingly, it seems like the odds ratio for both types of plays does not necessarily decrease or increase as k changes. More so, yards-to-go, score differential, play_rank and pass_rank seem to have more of an influence on outcome than the previous 2nd and 3rd plays. This may signify that we need to work with more intricate relationships and compare these results with the results of more complex models.

General Additive Model

Our logistic regression developed promising results, but did not necessarily account for the possibility that we had a non-linear relationship between our predictors and game outcome. By allowing us to smooth over predictors that may have non-linear relationships, we can take advantage of general additive models to flexibly model these intricate relationships.

\(g(E[Y_i]) = \beta_0 + f(p_{i-1}) + f(p_{i-2}) + f(p_{i-3}) + f(ydstogo) + f(score diff) + f(passrank) + f(rushrank)\)

Surprisingly, after smoothing for score_differential and yds_to_go, the previous 1st play type lost its significance while the ratio for both play types increased substantially. Additionally, the run had a higher odds ratio than that of a pass for the first previous play - the exact opposite as our previous model. However, we developed similar results as logistic regression for the previous second and third play types. We hypothesize that the significant change in the first previous play stems from ydstogo and score differential having a sizable interaction effect on the first previous play, which makes sense as ydstogo and score differential measure current conditions of the game, and the most recent previous play contributes more change to current situations than further plays in time.

Multinomial Regression

We proceed in experimenting with model performance with multinomial modeling, which allows the model to perform with more granularity while still maintaining the form of logistic regression. In specific, we have the presence of short/deep and left/middle/right passes and left/middle/right runs, so a multinomial model would be suitable for interpreting the effect on predictors of multiple categories on the outcome.We first filter out the NAs in the pass types, as there are more null values in our more granular predictors.

\(Pr(Y = c|X) = \frac{e^x \beta_c}{\sum_{j}^{c}x\beta_j}\)

Pass short middle was the best-outcome strategy in term of odds ratio for all the previous plays with the exception of the 3rd previous play. We do not see a lot of difference in Run directions for the previous plays or for the other predictors, but do note that a middle run has a lower log odds than other directions for score_differential and ydstogo. This is reasonable - while a middle run may bring you closer to the other side of the field, the linebackers and nose tackles on defense often clog up the middle lane, which minimizes score potential. Middle runs also have lower odds ratio more often than not in previous plays. Analysis from other football analytics gurus have alsos shown that [short passes to the middle generate the most EPA]{https://sumersports.com/the-zone/hitting-the-hard-shots-why-the-middle-of-the-field-is-the-most-effective-throw-in-football-despite-the-best-quarterbacks-succeeding-elsewhere/}, but are just generally difficult to execute due to the amount of time the ball is in the QBs hands to generate such a play, as well as defensive schemes often wandering near the middle.

Multilevel Regression

Since our effects may not be constant, we let go of the fixed effects assumption present in the previous models to create a multilevel model. Specifically, we take advantage of the model’s suitability for nested structures in accounting for dependencies among the same cluster. We also included random intercepts for the offensive and defensive teams and their interaction with each other. The idea behind this model is not only include the differences among each offensive team, but also the influences of which team is on the defense and how they interact with the the offensive team. For example, the Dolphins and the Jets have very different run tendencies, and would also exhibit different play-calling behaviors when playing the Browns versus the Seahawks.

\(log(\frac{p}{1-p}) = \alpha_0 + \alpha_1 (p_{i-1}) + \alpha_2(p_{i-2}) + \alpha_3(p_{i-3}) + \alpha_4(ydstogo) + \alpha_5(scorediff) + u_p + u_d + u_{pd}\)

  Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7073 0.09269 7.631 2.335e-14
as.factor(prev_play_1)pass -0.2892 0.06274 -4.609 4.052e-06
as.factor(prev_play_1)run -0.3614 0.06306 -5.731 9.987e-09
as.factor(prev_play_2)pass -0.5473 0.06546 -8.361 6.213e-17
as.factor(prev_play_2)run -0.3956 0.0647 -6.114 9.718e-10
as.factor(prev_play_3)pass 0.2939 0.05638 5.213 1.86e-07
as.factor(prev_play_3)run 0.5766 0.05803 9.937 2.861e-23
ydstogo -0.1193 0.005069 -23.54 1.563e-122
pass_rank 0.01021 0.003471 2.942 0.003262
Term Variance
posteam:defteam 0.05676
defteam 0.01376
posteam 0.01819

The multilevel model generally follows that of logistic regression in that the 3rd previous play had the highest effect and the 2nd previous play had the lowest. However, we do find that the log odds seem to increase slightly for the first previous play and decrease in magnitude for the 3rd one. In terms of random effects, the interaction between offensive and defensive teams had a variance of 0.05676, which suggests that there was moderate variability on the intercepts between different combinations of offensive and defensive teams. There seemed to be slightly more variability between different offensive teams than in different defensive teams. On the other hand, correlation between fixed effects seems to be more prominent in prev_plays and yards to go than the different k previous plays with each other.

In all its entirety, our four model each brought us varying insights on the underlying relationship between our data and our goal at predicting the best play given the k previous plays. We now aim to test the validity of each of these models.

Results

Misclassification Rate
Model 1 0.3606
Model 2 0.3511
Model 3 0.7818
Model 4 0.3654

With the four models we trained, we decided to evaluate them using misclassification rate, given that our problem is that of classification (labeling each play as either a run or a pass for three of our models, and classifying each play with its length and location for the multinomial model). In order to account for the uncertainty of our models’ performance on out-of-sample data, we performed 5-fold cross-validation on each of our models and achieved average misclassification rates, shown above. Our second model, the GAM, performs the best with a 5-fold CV misclassification rate of 0.35, meaning that on average, we expect our GAM model to incorrectly predict the run/pass about 35% of the time. Our first model and last model (GLM and GLMER models) perform very similarly, but ever so slightly worse (predicting incorrectly about 36% of the time). Our third model, the multinomial model, performs the worst with a misclassification rate of about 77%. This is likely due to the fact that our classification is more granular, as we now have nine different classes we could predict instead of the two we previously had.

Discussion

Overall it was interesting to see how similarly our models performed given the breadth of our features This may suggest that our features were not distinct enough from each other or we did not select enough features. This could be plausible as play-calling is a complex process representing the battle between offensive and defensive coaching minds along with the players on the field, not to mention field position. As such, it is plausible that one limitation of our approach is simply that we did not perform enough vetting of the features we chose. In the future we could potentially add more features such as the ranking of the defensive team on the field versus the pass or the run (currently no defensive statistics are considered) or features such as which team is home versus away, or tuning our models on different numbers of prior plays recorded.

Secondly, the question of what our training data should look like was something that we could potentially change as well. By nature our data obscures counterfactual outcomes, e.g. the offensive team induces a treatment in the form of a play call, causing some outcome in the form of yards gained, which doesn’t allow for seeing what would have happened should a different play have been run. As a result we chose not to include ‘failed’ plays in our training process, and so we modeled the probability that a given play occurred given that it was successful. This process wastes data: no information from the failed plays is in our trained models. Expanding our analysis to be a causal inference question in which we attempt to model what play calls could reverse or alter the outcomes of failed plays could represent an interesting research question.