36-315 Final Project: Statistical Analysis on NFL Play-by-Play Data

Introduction:

The NFL is a game centered around analytics and statistics. It is important for each team to understand the numbers and probabilities that go into each play they choose and each decision they make. In this project, we will attempt to answer questions surrounding play-by-play data of the NFL from multiple years. We used an open source database called NFLFastR to gather our data. While our database itself contains millions of data points, for each question, we subsetted the data to answer each respective one.

Data Source Description:

The NFLFastR database contains play-by-play data that stems from 1999 to 2021. The dataset includes hundreds of variables concerning topics such as winning probabilities by play, player names, variables that only pertain to the offense and defense, and variables pertaining only to the postseason and regular season. Each column of the dataset is a specific variable and each row is a specific play. Some variables are not included or have NA data prior to 2006 when the NFL’s Next Level data started collecting more curated variables such as the Completion Percentage over Expectation (which will be further explained in one of our main questions). Due to the scope of the dataset, each question used a different subset of variables. Thus, our data description will be contained and will vary in each question below.

Question 1:

In this question, we wanted to learn about how quarterback play has changed throughout the years. For this question, we are gathering data from 2006 to 2021 as 2006 is the year where the NFL started tracking the variables of interest that we will be describing next. Specifically, we aimed to look at different quarterback metrics which, based on our dataset, suggest we use the following variables:

Expected Points Added (EPA): the difference in the expected points at the start and end of play. For a quarterback, this assigns a number for the value of the play, determining the contribution the the success of a team.
Completion Percentage over Expectation (CPOE): measures the success of a pass relative to the difficulty of the throw
Air Yards: the amount of yards the ball travels in the air
QB Scramble Fraction: the fraction of times a QB has scrambled and turned a play into a run. This can be a proxy for whether QB’s are more inclined to run if the play does not develop well and would rather run than throw the ball away.
Winning Probability: the probability the quarterback’s respective team winning the game after every play
Interceptions: the number of interceptions the quarterback throws within a season
Shotgun Formation: the fraction of plays where the quarterback was in shotgun formation rather than under center.

Other variables:

Number of Dropbacks: Number of Dropbacks the quarterback had in a season
Number of Plays: Number of plays the quarterback participated within a season
Team played for: The team the quarterback belongs to
IDL: Player ID
Name: Player name
Season: Regular season year
Season Type: Regular or postseason

Furthermore, we divided the data by a QB’s performance within a season (i.e. fraction in a season, total number within a season). We also filtered quarterbacks with at least 100 dropbacks and 100 plays since there are many instances in the data where non-qb’s threw the ball during a play. This can be due to a “trick-play” called for example. In total, our filtered dataset has 15 variables with 721 observations throughout the 16 seasons.

In order to answer this question, we first created conducting Principal Component Analysis with the variables we described above. The following graph shows our analysis:

PCA Plot

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Season	0.0822032	-0.6485330	0.1898906	0.1655011	0.0826158	0.5392643	0.4599572	0.0025500
EPA	0.5150193	-0.0838377	-0.2772661	0.0379278	-0.1082418	-0.3940321	0.3752213	-0.5845022
CPOE	0.4652750	0.0006655	-0.2901843	0.0511665	-0.6542776	0.3549893	-0.2809028	0.2523704
Air Yards	0.4791674	0.0848220	0.3957570	-0.1704420	0.0802810	-0.3290205	0.2998493	0.6106723
Scramble Fraction	-0.0801454	-0.2909970	-0.2383247	-0.9226084	-0.0076460	0.0255394	0.0057474	0.0121946
Winning Probability	0.4459289	0.0173567	-0.3122543	0.0349732	0.7366334	0.2205564	-0.3300349	0.0438559
Interceptions	0.2670293	0.2666117	0.6488233	-0.2732122	-0.0388350	0.3177540	-0.2043403	-0.4666543
Shotgun Percentage	0.0730415	-0.6396239	0.2663858	0.1114809	-0.0523332	-0.4115773	-0.5728188	-0.0435989

First, we see that the first 3 variables account for 75% of the variability within our dataset. The percentage of explained variance per component is further broken down in the table. As we note that orthogonal / perpendicular lines in our biplot mean independence, the biplot conveys our PCA analysis in a two-dimensional space. One thing to note is that it seems that the Shotgun Percentage has a strong positive correlation with the season. However, these two variables do not correlate strongly with winning metrics such as winning probability, completion percentage over expectation, and expected points added. It also seems that the scramble fraction is increasing and has a positive correlation with season as well. This also coincides with a slight negative correlation with the number of interceptions. In terms of our original questions, since we know that the scrambling fraction is correlated with the later seasons, we want to know if this has effected the amount of yards the ball is in the air along with if it has any relation with winning probability. As our biplot may say otherwise, linear regression analysis may shed more light to these questions,

Statistical Tests for PCA

## 
## Call:
## lm(formula = wp ~ qb_scramble, data = qbs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30549 -0.07351  0.00285  0.07358  0.33169 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.451188   0.007061   63.90   <2e-16 ***
## qb_scramble -0.214897   0.164003   -1.31    0.191    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1106 on 719 degrees of freedom
## Multiple R-squared:  0.002382,   Adjusted R-squared:  0.0009948 
## F-statistic: 1.717 on 1 and 719 DF,  p-value: 0.1905

## 
## Call:
## lm(formula = air_yards ~ qb_scramble, data = qbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2737.7 -1362.2   250.5  1230.1  3468.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3378.70      93.16  36.267  < 2e-16 ***
## qb_scramble -8074.50    2163.79  -3.732 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1459 on 719 degrees of freedom
## Multiple R-squared:  0.019,  Adjusted R-squared:  0.01763 
## F-statistic: 13.93 on 1 and 719 DF,  p-value: 0.0002052

As we see above, on average, there is no statistically significant relationship between the QB scramble fraction and winning probability. This can mean that the shift to running QB’s has not led to an increase in a team’s likelihood of winning. However, we do see that on average from 2006 to 2021, a .01 increase in the QB scramble fraction predicts a decrease in air yards by -80.74 yards in a season. Both these tests are done at an alpha value of -0.05.

While these test may shed more light on whether quarterback play has changed throughout the years, it still does not show a causal effect of why quarterback scrambles are more prevelant as years go by. One hypothesis for this can be due to the increased adoption and implementation of the run-pass-option offensive scheme (RPO). The RPO became more popular especially with the drafting of Robert Griffin III in 2012 by the now named Washington Commanders.

In order to test this hypothesis, first we conveyed trends in the QB scramble fraction, Shotgun fraction (used heavily in the RPO scheme), and interception numbers per seasons.

Time Series Plot

As we see from our timeseries plots above, visually it seems that the number of total interceptions have decreased from 2006 to 2021. In the context of our hypothesis, this makes sense as many RPO plays are quick passes that are not prone to being intercepted as easily as a normal scheme pass. Furthermore, within this time frame we also see that the QB scrambling fraction overtime has increased along with the Shotgun fraction. While visually, it is clear to see these trends, our next task would be to conduct a t-test for our variables on whether they are on average different pre and post-RPO adoption.

Statistical Tests for Time Series

## 
##  Welch Two Sample t-test
## 
## data:  interceptions by rpo_era
## t = 2.0783, df = 554.63, p-value = 0.03814
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.04473348 1.58592292
## sample estimates:
## mean in group 0 mean in group 1 
##       10.120567        9.305239

## 
##  Welch Two Sample t-test
## 
## data:  qb_scramble by rpo_era
## t = -3.1931, df = 618.89, p-value = 0.001479
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.009732769 -0.002320149
## sample estimates:
## mean in group 0 mean in group 1 
##      0.03130148      0.03732793

## 
##  Welch Two Sample t-test
## 
## data:  shotgun by rpo_era
## t = -26.269, df = 399.37, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.3263638 -0.2809162
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4598334       0.7634734

As we see above, the number of interceptions is statistically significantly less during the years 2011 and onward compared to 2006 to 2010. Furthermore, both the QB and Shotgun scramble fractions are statistically significantly greater during the years 2011 and onward compared to 2006 to 2010. Both these tests are conducted up to an alpha level of 0.05.

Conclusion

Based on the PCA graph, we see that more recent seasons have seen quarterbacks scramble more often, use the shotgun formation more, and throw less interceptions. However, based on our linear regression analysis, we find that the increase in the QB scramble fraction is not correlated with an increase in winning probability on average but is significantly correlated with a decrease in air yards per season at least to a 95% confidence interval. We further test whether these shifts in quarterback play line up with the adoption of the RPO scheme agreed to be greatly accepted in 2011. Based on our time series plots and t-tests, we find that the QB scramble and Shotgun fraction, both of which are heavily used in the RPO scheme, on average are greater in the years of 2011 and onward in relation to 2006 to 2010. Finally, a t-test conveys that the total number of interceptions decreased in 2011 and onward in relation to 2006 to 2011. All these tests were done up to an alpha level of 0.05. These tests do tell us that the shift in QB play may possible be correlated and caused by the adoption of the RPO. However, more work including graph analysis has do be done in order to concretely determine if this is a causal relationship.

Question 2

One thing we aimed to examine is the relationship between defensive performances and points allowed in the 2021 NFL season and playoffs. To do this, we compiled five defensive variables along with the points allowed by each team.

ints: Total number of interceptions by every team’s defense in every game of the season.
sacks: Total number of sacks by every team’s defense in every game of the season.
fumbles: Total number of fumbles recovered by every team’s defense in every game of the season.
incompletions: Total number of incomplete passes that occurred, accidentally or forced, counted for every team’s defense in every game of the season.
yds.allowed: Total number of yards given up by a team’s defense for every team in every game of the season.
pts.let.up: Total number of points allowed by a team’s defense for every team in every game of the season.

With these data, we wanted to see if there were any patterns between them individually and points allowed, along with if the defensive statistics have any confounding relationships among themselves.

Pairs Plots

This initial graph is to explore some basic EDA of the variables and to also see if there are any significant relationships between the defensive statistics (ignoring the pts.letup row and column). Starting with any relationships between pts.let.up and the rest of the variables, it seems that all of the variables have a trend with pts.let.up. As yds.allowed increases, pts.let.up seems to generally increase, and as the rest of the variables increase, pts.let.up seems to generally decrease, though at varying slopes and with some, notably fumbles, being more unclear than others. All of this seems to mesh with common sense, which says that better defensive performances lead to giving up less points.

As for confounding between the defensive statistices, there appears to be little. Fumbles especially appear to have no discernable relationship with any of the other variables. Sacks and interceptions appear to have a slight negative relationship with yards allowed and not much of a relationhip with incompletions. Yards allowed and incompletions also appear to have little to no relationship. This information is good to know, because if there were relationships between these variables, then the overall relationships with points allowed might not be totally due the defensive statistics and we might misinterpret the overall relationships as a result.

PCA Plot

This second graph aids in the bulk of the analysis we’re interested in. This graph shows a principle component analysis of all of the variables of interest. It shows that as the number of sacks increasess, the first two principle components tend to decrease, as the number of interceptions, fumbles, and incompletions increase, the first principle component decreases and the second increases, and as yards and points allowed increase, both of the first two principle components increase. These groupings also suggest the variables’ correlations. Notably, it appears that points allowed is positively correlated with yards allowed and negatively correlated with the rest of the variables. This seems to match with what we noticed in the first graph.

## 
## Call:
## lm(formula = pts.let.up ~ ints + sacks + yds.allowed + incompletions + 
##     fumbles, data = nfl.important)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3093  -4.9334  -0.3727   4.1950  24.8220 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.659107   1.610118   4.757 2.49e-06 ***
## ints          -2.278332   0.306000  -7.446 3.54e-13 ***
## sacks         -0.479636   0.185439  -2.586  0.00994 ** 
## yds.allowed    0.071606   0.003537  20.247  < 2e-16 ***
## incompletions -0.516369   0.065057  -7.937 1.09e-14 ***
## fumbles       -1.263565   0.410283  -3.080  0.00217 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.758 on 576 degrees of freedom
## Multiple R-squared:  0.561,  Adjusted R-squared:  0.5572 
## F-statistic: 147.2 on 5 and 576 DF,  p-value: < 2.2e-16

To confirm the suspicions we had about the associations between the defensive variables and points allowed, we ran a linear model with pts.let.up as the response and ints, sacks, yds.allowed, incompletions, and fumbles as predictors. Looking at the summary of the model, it appears that all of the variables are significant predictors of points allowed. Also, based on the output coefficients, it is confirmed that yards allowed is positively correlated with points allowed and the rest are negatively correlated. In terms of magnitude, an extra interception predicts the largest decrease in points allowed, followed by an extra fumble, an extra incompletion, and then an extra sack. Allowing one less yard associates with a small decrease in points allowed.

Conclusion

Given the above analyses of the graphs and linear model, we can conclude that there is a definitive and significant relationship between points given up by a team and various defensive statistics. More interceptions, sacks, incompletions, and fumbles forced and allowed is associated with giving up less points in a game, and more yards allowed is associated with giving up more points in a game. Since there is little relationship between these predictor variables, they all matter individually for their contribution to lessening points given up. Interceptions seem to matter the most when trying to limit points given up, and fumbles are also fairly meaningful in preventing points scored. The other variables matter a bit less overall in terms of magnitude of change, though.

All of this matches with common sense. Stopping your opponent from gaining yards or making the opponent lose yards is good, but doing that and taking possession of the ball in the process is even better, since then the opposing team can’t even go on to score points. Looking towards future research on this subject, the biggest area to look towards we think is causal inference of the variables. Here we just analyzed an association, but getting a definitive causal relationship between the variables would be very useful for actual NFL teams. To do this you would need to measure the confounding better, which would require analysis of more variables. It would also be interesting to look at the offensise side of the ball in the same scenario.

Question 3

There are many interesting questions regarding what effects a play call in the NFL. We will be looking at the question of how the time of game effects the type of play that is called, and how the score of the game effects the play that is called

To answer these questions we will be looking at a few variables from the nfl dataset. The variables used and there explanations are below:

play_type: The type of play that is called at a given time in a given nfl game in a given season. We will be looking at all of the run and pass plays from every time in every game in every available season. We focuss on run and pass plays as these are the main offensive play types
game_seconds_remaining: The amount of seconds left in the nfl game when a particular play was called
game_id: The unique id that allows a specific game to be recognized and allows the sorting of all plays from a specific game
desc: The description of a given play from the broadcasters
home_score: The score of the home team when the current play was called
away_score: The score of the away team when the current play was called
home_team: The home team in a specific game
away_team: The away team in a specific game

Time Series Graph

Here we have taken all of the plays from a year in football and the time of the game that they occurred. We can then see if there is a trend in what play is more likely to be chosen depending on how much time is left in the game. We see here that there is a very strong uptick in pass plays at the end of the first half as teams try last ditch efforts before kickoff of the second half. We also see a slight uptick in run plays at this time as some teams may be running the ball if they are close to the end zone. We also see an uptick in pass plays at the end of the game while we see a downtick in run plays at the end of the game. This makes sense because some teams may run the ball at the end of the first half but few teams would run the ball if they are down and near the end of the game. This is the time you see teams take a last hail mary and hope they can win. This explains the uptick of pass plays at the end of the game. We also see there are a lot of run plays at the beginning of the games as most teams start out with a few run plays to test the waters and get the offense warmed up before the QB starts throwing and possibly gives up an interception due to early game jitters. I am interested in seeing how time of game effects the play type that is called. I want to further look at how the standing of a team whether winning or losing as well as time of game effects the play type. I also want to broaden the play types I am looking at in the future.

Statistical Tests for Time Series

Now we are going to test whether the changes that were seen in the amount of pass plays and run plays were significant

## 
##  Welch Two Sample t-test
## 
## data:  data_plays_pass1$n_play and data_plays_pass2$n_play
## t = -6.4487, df = 1387.6, p-value = 0.0000000001554
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.931307 -1.030372
## sample estimates:
## mean of x mean of y 
##  4.802945  6.283784

## 
##  Welch Two Sample t-test
## 
## data:  data_plays_pass3$n_play and data_plays_pass4$n_play
## t = -8.1803, df = 1408.7, p-value = 0.0000000000000006258
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.5441810 -0.9468349
## sample estimates:
## mean of x mean of y 
##  4.842760  6.088268

## 
##  Welch Two Sample t-test
## 
## data:  data_plays_run1$n_play and data_plays_run2$n_play
## t = 0.38607, df = 1692.3, p-value = 0.6995
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2225947  0.3317017
## sample estimates:
## mean of x mean of y 
##  3.849425  3.794872

## 
##  Welch Two Sample t-test
## 
## data:  data_plays_run3$n_play and data_plays_run4$n_play
## t = -1.2655, df = 1360, p-value = 0.2059
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.43098317  0.09298007
## sample estimates:
## mean of x mean of y 
##  3.813657  3.982659

We see that for the first test, we are testing if the uptick in pass plays in the second quarter is significant. We see that we have a p-value of 0.0000000001554 which is less than 0.05 and therefore we have sufficient evidence to reject the null that there was no significant difference between the amount of pass plays in the first and second quarters. This means that there was a statistically significant increase in the amount of pass plays from the first quarter to the second quarter. For the second test, we are testing if the uptick in amount of pass plays in the fourth quarter is significant. We see that we have a p-value of 0.0000000000000006258 which is less than 0.05 and therefore we have sufficient evidence to reject the null that there was no significant difference between the amount of pass plays in the third and fourth quarters. This means that there was a statistically significant increase in the amount of pass plays from the third quarter to the fourth quarter. For the third test, we are testing if the uptick in run plays in the second quarter is significant. We see that we have a p-value of 0.6995 which is not less than 0.05 and therefore we don’t have sufficient evidence to reject the null that there was no significant difference between the amount of run plays in the first and second quarters. This means that there was no significant increase in run plays from the first quarter to the second quarter. For the fourth and final test, we are testing if the uptick in run plays in the fourth quarter is significant. We see that we have a p-value of 0.2059 which is not less than 0.05 and therefore we don’t have sufficient evidence to reject the null that there was no significant difference between the amount of run plays in the third and fourth quarters. This means that there was no significant increase in run plays from the third quarter to the fourth quarter.

In conclusion, We see that there is a significant increase in pass plays at the end of each of the halves, while there is no significant increase in run plays throughout the game. This means that we have shown that the time in a football game effects the type of play that will be called. This can be seen in the fact that when time is winding down at the end of either of the halves, teams choose more pass plays while the amount of run plays remains the same.

World Cloud Graphs

These two word clouds were constructed from the broadcasting text associated with each play in a given football game. The first word cloud was constructed from text that was recorded during any regular part of the game, excluding the end of the first half and end of the second half. The second word cloud was constructed solely from text recorded during the ending of the first and second half. In this way, we will be able to observe whether there is a change in the language regarding plays depending on what point of the game the announcers are speaking. We see an interesting result here. Something surprising is the fact that the words ‘run’, or ‘rush’ don’t seem to appear in either of the word clouds. This isn’t ideal as we were intending to look at the size relationship between words associated with running and passing plays and how this size relationship changes at different points in the game. However, we can still reach some interesting conclusions. Specifically looking at words associated with passing in both of the word clouds (as words associated with running are lacking) we can see a trend. We see the words “pass” and “shotgun” increase a large amount from the first word cloud to the second. The word “shotgun” stands for a football formation that allows longer pass plays and is a formation that is generally used when a team wants to pass. Two other words that may be overlooked are “incomplete” and “deep”. These words actually signify more passing in the game. We see “incomplete” increase largely from the first cloud to the second, and “deep” also increase from the first cloud to the second cloud. These results matter as if there are more pass plays, you will see an increase in incompletions. An incomplete signifies that a pass play was called so this further shows an increase in pass plays at the end of the halves. The increase in the word “deep” gives a similar conclusion. “deep” is referencing a deep pass play which shows teams are going for deep pass plays more near the end of the game. In the end, while it is surprising that there is a lack of mention of run plays, we can clearly see the trend that announcers are increasingly talking about pass plays near the ends of halves.

Bar Plot

We see an interesting result and that is that pass plays are chosen much more than run plays in a certain interval of difference in score. If a team is losing be a very large amount, or winning by a very large amount, they are likely to have a more even ratio of run to pass plays. If a team is losing by a little and still has a chance to score and take the lead, they will use pass plays to try to get back in the game. We also see if a team is winning by a small amount, the team may be looking to increase their lead and call more pass plays to try and score more points quickly. It is interesting the see the ratio of pass plays to run plays even out as a team starts winning by a large amount. This can be explained by the fact that once teams start winning by a lot of points, they are trying to slow down the game and run a lot, as well as pass when necessary and playa more balanced strategy. Sometimes teams in this position may also put in back up players which also increases the likelihood of a more balanced ratio of play types and more basic strategy.

KS Test

We want to see if the distribution of run plays and pass plays when considering score in the game are significantly different from each other. We know that there is a natural increase in run and pass plays in the middle interval of our differences in scores, but we want to see if the increase seen in pass plays is significantly bigger than the increase in run plays.

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  data_run$n_play and data_pass$n_play
## D = 0.28788, p-value = 0.008425
## alternative hypothesis: two-sided

We see that we have a p-value of 0.008425 which is less than 0.05. Therefore, we have sufficient evidence to reject the null hypothesis that these two distributions are the same. This means that there is a bigger increase in pass plays during the middle interval in difference of scores than in run plays. This confirms that the score the game does effect the type of plays chosen by a given team. Specifically, if a team is trailing by a small amount of winning by a small amount the team will choose more pass plays, otherwise the team will choose evenly between pass and run plays.

Conclusion

As the time winds down in the first and second halves, we saw a statistically significant increase in pass plays over run plays. We also analyzed the text from the announcers from the end of the halves vs the text from all other points in the game. We saw that the text from the end of the halves follows the pattern of more pass plays over run plays at the end of the halves. This pattern can be seen from the differences in the verbiage and differences in the frequencies of certain words between the two word clouds. Specifically, we see an increase in words related to pass plays. Finally, we see that the score in a game does have an effect on whether a run or pass play is called, as there is a significantly bigger increase in pass plays than run plays when a team is closely trailing or closely winning.

Question 4

In this question, we explore whether or not ‘home field advantage’ has a statistical foundation within the NFL and, if so, why that might be the case. For our purposes, we will be looking at data from 1999 through 2021 to see how score distributions have evolved over time. Each observation in our dataset reflects a specific team’s performance over an entire game. Further, the following variables will be explicitly used in our analysis:

home_score: The total score that the home team achieved in a given game
away_score: The total score that the away team achieved in a given game
score: The total score that the observed team achieved in a given game
h_pos: Binary indicator that displays if the observed team is playing at their home location
extra_point: The count of “Extra point” plays that occur within a given game for the given team
field_goal: The count of “Field goal” plays that occur within a given game for the given team
kickoff: The count of “Kickoff” plays that occur within a given game for the given team
no_play: The count of “No play” plays that occur within a given game for the given team
pass: The count of “Pass” plays that occur within a given game for the given team
punt: The count of “Punt” plays that occur within a given game for the given team
qb_kneel: The count of “QB Kneel” plays that occur within a given game for the given team
qb_spike: The count of “QB Spike” plays that occur within a given game for the given team
run: The count of “Run” plays that occur within a given game for the given team

There are variables not mentioned above that were used strictly for identification purposes of each observation (i.e. game ID). In total, we maintain a dataset with 26 variables over 12,268 observations. Later in our analysis, we only consider games within the 2021 season which drops our number of observations to 570.

Below is a summary of these variables for 2021.

home_score	away_score	extra_point	field_goal	kickoff
Min. : 0	Min. : 0.00	Min. :0.000	Min. :0.000	Min. : 1.000
1st Qu.:17	1st Qu.:15.00	1st Qu.:1.000	1st Qu.:1.000	1st Qu.: 4.000
Median :23	Median :22.00	Median :2.000	Median :2.000	Median : 5.000
Mean :24	Mean :22.07	Mean :2.318	Mean :1.888	Mean : 5.109
3rd Qu.:31	3rd Qu.:30.00	3rd Qu.:3.000	3rd Qu.:3.000	3rd Qu.: 6.000
Max. :56	Max. :51.00	Max. :8.000	Max. :7.000	Max. :10.000

no_play	pass	punt	qb_kneel	qb_spike	run
Min. : 0.000	Min. : 3.00	Min. : 0.000	Min. :0.0000	Min. :0.0000	Min. : 7.00
1st Qu.: 3.000	1st Qu.:30.00	1st Qu.: 3.000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:20.00
Median : 4.000	Median :37.00	Median : 4.000	Median :0.0000	Median :0.0000	Median :25.00
Mean : 4.518	Mean :36.86	Mean : 3.837	Mean :0.7158	Mean :0.1298	Mean :25.89
3rd Qu.: 6.000	3rd Qu.:43.00	3rd Qu.: 5.000	3rd Qu.:1.0000	3rd Qu.:0.0000	3rd Qu.:31.00
Max. :16.000	Max. :68.00	Max. :11.000	Max. :4.0000	Max. :2.0000	Max. :48.00

Initial Inspection

The above plot shows the total score that a team achieved in any given game segmented by whether or not that team was playing at their home location. It’s further facetted by season to show how, if at all, home field advantage has changed over time.

The graphic implies that, at least some of the time, home field advantage does exist. Nearly every year from 1999-2018 sees the away team tending to achieve their mode at a lower score than the home teams. Interestingly, 2019 and 2020 see an almost equivalent set of score distributions. Further, 2021 inverts this idea and sees the away team tending to perform better. Thus there appears to be a historical foundation for home field advantage but that has been subverted in recent years. For simplicity and relevance, moving forward we will only consider the most recent season of 2021 to analyze what might be going on here and whether there is a statistically significant foundation for what we see above.

Statistical Tests

The above plot is an enlarged version of what we saw earlier explicitly for 2021. This represents the essence of our question at hand – is there a statistical backing to the implied difference we see in score distributions segmented by team location? In order to assess this, we performed a two-sample t-test with the null hypothesis being that both distributions have the same mean. The associated p-value was 0.025 which indicates that the true difference in means is not equal to zero. Thus it would seem that the away teams are outperforming home teams. However, we also performed a two sample Kolmogorov-Smirnov test to determine whether there is a statistically significant difference between the entirety of the two distributions. Interestingly, the p-value for this test was 0.311 which means we cannot conclude that these two distributions are truly different. While these two tests do not look at the exact same things, they still show some level of conflicting conclusions and thus we will need to delve deeper.

Play Breakdowns

A natural next step beyond analyzing just the scores is to consider what influences the score. This can be done given the play counts within our dataset. Thus we consider the question: do home teams tend to perform different strategies than away teams?

The above principal component analysis biplot visually represents exactly what we want to know. Each point represents a team’s performance in a game plotted along dimensions that were generalized solely by the plays each team performed. This data is segmented by group with away teams being represented by black circles and home teams being represented by yellow triangles. Furthermore, the respectively colored ellipses show the overall grouping by team location and the blue arrows represent how that variable influences the value along each dimension. In total, we can see that home and away teams tend to perform the same plays. However, there is an exception towards the bottom right of the graph which indicates that the home team tends to perform more ‘QB Kneel’ and ‘Extra Point’ plays. However, it still remains to be seen whether this is a significant difference.

Model Inference

All of the previous exploratory data analysis has given us conflicting results and the remaining question of whether or not the differences we observe are true differences in the underlying distributions. The final portion of our analysis utilizes two linear models to assess the relationship between our features and a home field advantage.

The first linear model serves to see whether we can predict if any given team is a home team based on their plays. That is to say we predict h_pos by extra_point, field_goal, kickoff, no_play, pass, punt, qb_kneel, qb_spike, and run. Thus if any of the coefficients within our model have a low p-value, we can conclude that there is a statistically significant difference between the types of plays home and away teams perform. We found that, as one might expect, teams do not alter their strategy based on the location they are playing in. None of the above features had a p-value of less than 0.05 when trying to predict h_pos.

The second linear model serves to see whether we can predict the score of any given team if we know their play counts and whether they are playing at their home field. That is to say we predicted score by extra_point, field_goal, kickoff, no_play, pass, punt, qb_kneel, qb_spike, run as well as the interaction term between each of these variables and h_pos. The model revealed that there are significant relationships with some of the base variables (i.e. more field goal plays tends to lead to a higher score) but this is to be expected. More importantly, none of the interaction term features had a significant p-value indicating that there is no significant difference in the scoring power of home vs away teams. In other words, if the interaction term for extra_point was significant, this may imply that the referees give preference to the home team. Similarly, if the interaction term for field_goal was significant, this may imply that home teams have a higher chance of successfully performing the play on their home turf as compared to being away. Again, this was not the case for any of the interaction terms.

Conclusion

While there initially appeared to be differences in score distribution by team location based on the density curves, all other indicators fail to show that this is statistically significant. Our statistical tests fail to conclusively show a difference in team performance. Our linear models fail to show that home teams perform different plays based on their location and further fail to show that home teams are better at successfully scoring based on their plays as compared to away teams. Altogether, we cannot conclude that there is a statistically significant foundation for ‘home field advantage’ in the 2021 season based on our dataset.

With regard to future work, it is possible there are confounding variables unmeasured here but revealing of a true distributional difference. Exploration with expanded features is recommended but would require additional data collection.

Final Conclusions Summary

Question 1

More recent seasons have seen quarterbacks scramble more often, use the shotgun formation more, and throw less interceptions.
We find that the increase in the QB scramble fraction is not correlated with an increase in winning probability on average but is significantly correlated with a decrease in air yards per season at least to a 95% confidence interval.
We find that the QB scramble and Shotgun fraction, both of which are heavily used in the RPO scheme, on average are greater in the years of 2011 and onward in relation to 2006 to 2010.
Finally, a t-test conveys that the total number of interceptions decreased in 2011 and onward in relation to 2006 to 2011.

Question 2

There is a definitive and significant relationship between points given up by a team and various defensive statistics.
More interceptions, sacks, incompletions, and fumbles forced and allowed is associated with giving up less points in a game, and more yards allowed is associated with giving up more points in a game.

Question 3

As the time winds down in the first and second halves, we saw a statistically significant increase in pass plays over run plays.
We saw that the text from the end of the halves follows the pattern of more pass plays over run plays at the end of the halves.
We see that the score in a game does have an effect on whether a run or pass play is called, as there is a significantly bigger increase in pass plays than run plays when a team is closely trailing or closely winning.

Question 4

Initial exploratory data analysis indicates the potential existence in differences between home team scores and away team scores.
The usage of statistical tests and linear models to ascertain the nature of this relationship failed to conclusively show a difference in team performance.
We cannot conclude that there is a statistically significant foundation for ‘home field advantage’ in the 2021 season based on our dataset.

36-315 Final Project: Statistical Analysis on NFL Play-by-Play Data

Skyler Mason, Eli Cohen, Jacob Riviere, Verne Garin

5/2/2022

Introduction:

Data Source Description:

Question 1:

PCA Plot

Statistical Tests for PCA

Time Series Plot

Statistical Tests for Time Series

Conclusion

Question 2

Pairs Plots

PCA Plot

Conclusion

Question 3

Time Series Graph

Statistical Tests for Time Series

World Cloud Graphs

Bar Plot

KS Test

Conclusion

Question 4

Initial Inspection

Statistical Tests

Play Breakdowns

Model Inference

Conclusion

Final Conclusions Summary

Question 1

Question 2

Question 3

Question 4