The purpose of this project is to construct a model to automatically predict the coverage type of a defensive player based on tracking data. A critical component to the development of such a model is the presence of labeled data, upon which the model can be evaluated and refined. To that end, a website has also been created which can allow for crowd sourcing information about defensive coverage types.
This project builds upon the work of Dutta et al., who originally built a model for the same purpose of identifying defensive coverage types. This project attempts to advance those ideas by using a different clustering method, incorporating labeled information, and generalizing to positions other than cornerbacks.
Publicly available statistics for defensive players is mostly limited to play-level information about tackles, interceptions, and sacks. While the NFL does release innovative metrics through NextGen Stats, these statistics are just for passing, rushing, and receiving. Private companies, such as Pro Football Focus, do release detailed analysis of defensive players - including their coverage type -, but this information is not available to the public and it is not generated instantaneously.
A model that automatically predicts defensive coverage types for many players on every play provides unique information that is not otherwise publicly available. This information can be used as a cost-effective and quick way for teams to understand their opponent’s defensive play style. It can also be used by external researchers. Research on topics such as the efficacy of different defensive coverage styles has been limited by the unavailability of a publicly accessible large database of labeled play coverage types and this project can help advance study into defensive strategy.
The website is important for a variety of reasons. Since it requires users to log their confidence with each prediction, it provides a repository of soft labels. This can be used to identify plays with unconventional coverage and to create a more nuanced validation metric (since mixture models provide soft predictions which can be compared with the soft labels).
Another reason the website is important is that it is a relatively rare method to collect data. As aforementioned, much of the labeling work needed to produce insightful statistics is done by private companies who don’t release their data for free. The website, if successful, can show the potential of public crowd sourcing for the creation of new publicly available statistics and others can easily adopt similar methods for their own research into their own inventive metrics.
The data used in this project is tracking data provided by the NFL for the first six weeks (91 games) of the 2017-18 NFL season. The tracking data provides information about the location (in x,y coordinates), speed, and direction of a particular player every tenth of a second for the duration of the play.
The data includes 5776 unique passing plays. As there are often multiple cornerbacks on each play, there are 15,483 cornerback observations in the data. The distribution of the time until throw for each play is shown below (note that one second is ten frames):
There are two main types of pass coverage: man coverage and zone coverage. In man coverage, a defensive player defends an offensive player for the duration of the play. In zone coverage, a defensive player defends a zone during the play.
As such, the coverage type of a defensive player is determined by their relation to other players on the field and the features are constructed to represent this.
While features have been computed for cornerbacks, linebackers, and safeties, this project mostly focuses on predicting the coverage of cornerbacks. All graphs in this report (unless mentioned otherwise) are constructed from data exclusively from cornerbacks. However, the clustering methods applied to cornerbacks are easily generalizable to other positions.
The features used in this project are exactly the same as the ones used by Dutta et al. and broadly fall into two categories:
This project also generated new features not present in the original paper. However, none of the new features generated led to better model performance and thus are not used.
The features used are as follows:
Feature Name | Description |
---|---|
var_x | Variance in the x-coordinate |
var_y | Variance in the y-coordinate |
var_s | Variance in speed |
mean_off_dist | Mean distance to nearest offensive player |
var_off_dist | Variance in distance to nearest offensive player |
mean_def | Mean distance to nearest defensive player |
var_def | Variance in distance to nearest defensive player |
mean_dir_diff | Mean difference in direction between the player and the nearest offensive player |
var_dir_diff | Variance of the difference in direction between the player and the nearest offensive player |
rat_mean | Mean of the ratio of the distance to the nearest offensive player and the distance of that offensive player to the nearest defensive player |
rat_var | Variance in the ratio of the distance to the nearest offensive player and the distance of that offensive player to the nearest defensive player |
The features attempt to represent the information a human would use when classifying the coverage type of a player: the motion of the player across the field and the relative position and direction between the player and the closest offensive or defensive player.
To assist with the EDA process, the coverage types of dozens of players were manually labeled in either man or zone coverage. This was used to identify how specific features would be different under man or zone coverage. Labels were generated for 88 cornerbacks.
The distribution of features on the manually labeled data (colored by the coverage type) is as following:
The difference in the distributions seem to correspond with what one would expect intuitively. For example, on average, defenders in man coverage are closer to offensive players. Furthermore, the man coverage distribution almost always has a mode at or very close to zero, which shows that in man coverage the defenders follow the receivers closely. On the other hand, in zone coverage the mode is not often at zero, which indicates that the defenders mimic the actions of the receivers less in zone coverage.
The clear distinction between the distributions shows that these features are able to discern some underlying difference between man and zone coverage and suggests that a “man coverage” and “zone coverage” clusters may exist in the data.
It is also important to note that many of the features seem heavily right-skewed, which could make the high-dimensional clustering process more difficult.
Each point on the PCA visualization shows a two-dimensional representation of all the features for each player on a specific play.
The contribution measure in the graph is a scaled representation of the square of the correlation between the component axes and the variable. According to the graph, directional measures and the mean distance to the nearest defensive player provide a significantly lower contribution to the PCA.
The visual shows two distinct clusters for players that were manually labeled to be in man and zone coverage. The presence of these clusters further supports the idea that the features used in this project have the potential to reliably distinguish between coverage types (or at least approach human-level performance in the task).
The PCA also shows that a high degree of correlation exists between the features. Many of the vectors are pointing in similar directions to the other; in fact, approximately half of the vectors point upwards and to the left while the other approximate half point upward and to the right. Interestingly, the left half is exclusively summary features while the right half is exclusively relational features.
Since the variables are highly correlated, a small number of components could account for a lot of the variation in the data. This is exemplified in the following scree plot, which shows how much variance in the data can be explained by principal components.
A GMM is a probabilistic model that assumes the data comes from some underlying distribution that can be generated through a mixture of \(k\) Gaussian components, each with parameter \(\theta_{k}\) and mixing proportion \(\pi_{k}\).
\(f(x)=\sum_{k=1}^{K} \pi_{k} f_{k}\left(x ; \theta_{k}\right)\)
The mixing proportion is the weight of each component in the final model. For a 1D case, each individual Gaussian component is expressed as following:
\(f_{k}\left(x ; \theta_{k}\right)=N\left(x ; \mu_{k}, \sigma_{k}^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma_{k}^{2}}} \exp \left(-\frac{\left(x-\mu_{k}\right)^{2}}{2 \sigma_{k}^{2}}\right)\)
As shown above, many of the features are heavily right-skewed. This is problematic because Gaussian mixture modelling relies on the assumption that the underlying distribution of a feature can be represented as a mixture of Gaussian components. While a right-skewed distribution can be approximated as a mixture of Gaussian components, it could be difficult for a model to approximate many of them.
There are a variety of methods to mitigate this problem. The data could be log transformed to make the model more Gaussian and the model could be fit on that. Or the model could be fit on the principal components of the data. Alternatively, a mixture of multivariate t distributions can be used to approximate the underlying distribution in the data instead of a GMM.
Consequently, to create a clustering solution to identify defensive coverage types, 4 distinct models were used:
Like the GMM, the multivariate t mixture model assumes that the data comes from a parametric distribution that can be approximated as the mixture of components. However, in this case the components are multivariate t distributions. A multivariate t distribution is parameterized by a mean vector \(\mu_{k}\), a scale matrix \(\Sigma_{k}\), and \(\nu_{k}\) degrees of freedom (Andrews). So, the mixture of k multivariate t components is represented by:
\(f(x)=\sum_{k=1}^{K} \pi_{k} f_{k}\left(x ; \mu_{k}, \Sigma_{k}, \nu_{k}\right)\)
The multivariate t density, \(f_{k}\left(x ; \mu_{k}, \Sigma_{k}, \nu_{k}\right)\) is expressed as:
\(f_{t}\left(\mathbf{x} \mid \boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}, \nu_{k}\right)=\frac{\Gamma\left(\frac{\nu_{k}+p}{2}\right)\left|\boldsymbol{\Sigma}_{k}\right|^{-\frac{1}{2}}}{\left(\pi \nu_{k}\right)^{\frac{p}{2}} \Gamma\left(\frac{\nu_{k}}{2}\right)\left[1+\frac{\delta\left(\mathbf{x}, \boldsymbol{\mu}_{k} \mid \boldsymbol{\Sigma}_{k}\right)}{\nu_{k}}\right]^{\frac{\nu_{k}+p}{2}}}\)
Leave-one-week-out cross validation (LOWO CV) for weeks 1-6 using the adjusted rand index (ARI) as a metric can be used to evaluate the clustering models.
The ARI is a measure of similarity between clusters. It looks at the ratio of pair agreements to the total number of pairs and is adjusted for chance (Dutta).
LOWO CV for week k for cluster g is as follows:
Fit model with g components \(m_{train}\) on data from all weeks except for k
Fit model with g components \(m_{test}\) on data from week k and log the model’s classifications as \(p_{test}\)
Compute \(p_{train}\), the prediction of \(m_{train}\) on the data from week k
Compute the ARI on \(p_{train}\) and \(p_{test}\)
Repeat process for all values of g
Another model evaluation method is to compute the accuracy of the modeled upon labeled data. The accuracy of a model can be computed as follows:
The five most important features are as follows.
The importance of feature f is measured as the ARI difference when using LOWO CV on all features except for f and the ARI on LOWO CV on all features.
The accuracy for the two-cluster solution (even though it is not optimal according to the ARI) is 78%. The baseline accuracy (accuracy value when predicting every player is in zone coverage) is 51%.
For all models, the highest ARI value is the value for two clusters. This supports the notion that there are two main coverage types in the NFL: man and zone coverage.
The multivariate t mixture model is the best performing model out of all 4 - both in terms of ARI and accuracy. This suggests that multivariate t distributions is more robust than multivariate Gaussian distributions at representing summary statistics of a player’s movement across the field as well as the relations between a player and other players on the field.
The high ARI and accuracy values of the multivariate t mixture model also indicate that it is an effective model to predict man and zone coverage in the NFL. Furthermore, the fact that the top 5 most important features are all measures of variance suggests that variance is more important that mean when classifying a coverage type.
The website contains an introduction page (which provides information on how to identify man and zone coverage), a page to label plays, and an about us page. In the labeling interface: a user enters a username, watches an animation of a play, and labels the coverage type of a given player (identified by the jersey number) as well as the confidence in that label. This information, along with information about the play, is saved in a Google sheet.
Since the confidence of the users in their predictions is measured on a 0-100 scale, the website essentially logs the probability each cornerback is in man or zone coverage (as determined by the human user). Since mixture models are probabilistic models that also provide a probability for each classification, the website can be used to provide much more intricate comparisons between the model’s predictions and a human’s predictions.
The website can be found at: https://nflclustering.shinyapps.io/nflclustering/
This project gives further evidence that, for a player, there are two primary types of defense. It presents an effective model that can predict man and zone coverage and it provides a website that can be used as a source of data to develop further iterations of the model. This project also shows that multivariate t distributions are a better choice than a GMM on the features constructed from NFL tracking data. Additionally, the model is generalizable to positions other than cornerbacks, but a 3 cluster solution is preferable in such a case.
However, there are many limitations in this project. The model validation approaches primarily relied on hard classifications and didn’t take the model’s confidence for a particular prediction into account. The accuracy of the model is computed on 88 data points, which may be too small of a sample. Additionally, human-generated labels were taken as the ground truth in the accuracy computation and it is possible that there were a few mistakes in the labeling process. Finally, a high ARI value does not necessarily signify that the model’s classifications correspond to man and zone coverage as a human would see and understand.
There are also many avenues for further research. Further analysis can be done on the generalized version of the model. The model’s behavior on different frames within a play can be examined to determine how its predictions change over the duration of a play and the impact of time on the model. In fact, to better understand how model predictions change over time, a different model such as a hidden markov model can be used. The model can also be further analyzed to determine the situations in which particular players or teams employ man and zone coverage. Additionally, it could be used to compute the effectiveness of different players and teams in man and zone coverage. In fact, one theoretically could use the model to reconstruct a team’s defensive playbook (at a primitive level).
Thank you to Prof. Ron Yurko and Prof. Rebecca Nugent for all the feedback and help in developing the model and for providing the idea for the website. Dr. Mike Lopez also gave extremely helpful insights with regards to how such a model could be used and the next steps for this project.
Dutta, Rishav, et al. “Unsupervised Methods for Identifying Pass Coverage among Defensive Backs with NFL Player Tracking Data.” ArXiv.org, 14 Apr. 2020, https://arxiv.org/abs/1906.11373.
Andrews JL and McNicholas PD. ``Model-based clustering, classification, and discriminant analysis with the multivariate t-distribution: The tEIGEN family’’ Statistics and Computing 22(5), 1021–1029.