Our main research question is how to predict the annual salary of a National Hockey League (NHL) player based on their statistical outputs. The goal of our research is to produce a usable app to display past and present hockey seasons, and display the supposed “worth” of a player. Our model is trying to help league officials determine what a reasonable contract offer for a player should be in order to maximize a teams’ talent, but still maintain the salary cap restriction.
In the National Hockey League, the top professional league in the world, all current teams must abide by something called a salary cap. This dollar amount is set by the league officials and determines the total most amount of money they can pay their entire roster. In the NHL they have 20 players (18 skaters and 2 goaltenders) and for the upcoming 2022-23 season, they can pay these 20 players a maximum of $82,500,000. Unlike in other sports, it is a “hard” cap, meaning you cannot go over the designated amount and there is no luxury tax a team can pay to get around it. This number varies per year and is calculated by the designated percentage, agreed in their collective bargaining agreement, of previous league revenue.
What motivated us to do this project is because we wanted to create a model that was as accurate as possible in predicting NHL player salary cap. We are very interested to see how players are paid vs. how much they are “worth”. This is a difficult question to answer because there is so much to take into account when predicting how much a player is worth; things we might not even have access to or quantifiable data for. That being said, it is still going to be very interesting to see what various aspects of a players upfront performance is taken into account when computing salary cap. We are going to be very intrigued by what our model predicts for some of our fan favorite players.
We have used 3 different data sets, all publicly available, throughout our research; taken from public sources: MoneyPuck, CapFriendly, and fastRhockey
In our dataset there are 1513 different individual players that have played in the NHL from the 2010-11 season to the 2021-22 season. There are roughly 20 players per team, but the number of teams present in the league vary depending on the year. For example, in the 2017-2018 season the league went from 30 to 31 teams, and in the 2021-2022 season it went to 32.
A quick overview on some variables we used in our research; we decided upon examining situational data (i.e. 5on4, 5on5, etc.), goals, assists, blocks, varying types of shots (i.e. low danger, high danger, etc.), and various other offensive/ defensive statistics. In deciding these variables we simply went through each individual data set and manually picked out what we deemed to be important/ potential explanatory variables. Something to take note of is that our data is formatted to showcase within team statistics, not overall league stats.
player | position | team | games played | cap hit |
---|---|---|---|---|
Connor McDavid | F | EDM | 80 | 12500000 |
Artemi Panarin | F | NYR | 75 | 11642857 |
Auston Matthews | F | TOR | 73 | 11640250 |
Erik Karlsson | D | SJS | 50 | 11500000 |
Drew Doughty | D | LAK | 39 | 11000000 |
John Tavares | F | TOR | 79 | 11000000 |
We wanted to first look at the distribution of cap hit, for this we decided to stick with modeling a single season rather than every season together to avoid having repeated player values in the model.
We can see that majority of players (a little over 50%) are being paid around the $1,000,000 mark.
We now want to determine the minimal number of games to subset our data on. We performed an ECDF to statistically find the optimal number of games.
We found that a good number of games to subset by was 20, because roughly 75% of our data falls into those limits.
We considered, and tested, a few different modeling techniques. We observed linear regression, random forest, ridge regression, lasso, intercept only models, as well as a cubist regression model. We decided upon examining these specific models because our motivation is creating a predictive model, and these methods are all predictive modeling techniques. We first examined a linear regression, because it’s the most interpretable predictive model. We used ridge and lasso models to inspect if collinearity is present and affecting our model in an adverse way. We used a random forest model because we wanted to see how variables are weighted and how they affect our model as a whole. The cubist regression model is very similar to random forest, but it approaches predictions differently. Lastly, we used the intercept-only model as a baseline to clarify that we did in fact have important predictor variables in our data.
We were advised to examine a cubist regression method for modeling players projected cap hit. As stated before, when being compared to other models’ performances that we tested, the cubist model was by far the best model.
First a quick overview into what a cubist model is. A cubist regression model is an ensemble model of predictive trees (or random forest), where the paths, or “branches”, of each tree is a set of rules leading to a final leaf node. In each node the model performs linear regression and averages the nodes to create a prediction. This model also utilizes boosting, which allows us to calculate the optimal number of iterations, or trees, to use; in the cubist model the number of iterations is also known as committees.
Our tuning parameter in this model is identified as a “neighbor”. What a neighbor does is essentially determine how smooth our model is going to be, 0 being the smoothest 10 being the most fitted. So, we first want to determine which neighbor is best suited for our model.
## # A tibble: 10 × 3
## neighbor rmse r_squared
## <dbl> <dbl> <dbl>
## 1 0 0.00105 0.775
## 2 1 0.000179 0.703
## 3 2 0.000389 0.742
## 4 3 0.000237 0.755
## 5 4 0.000276 0.765
## 6 5 0.000302 0.771
## 7 6 0.000222 0.771
## 8 7 0.000159 0.774
## 9 8 0.000131 0.776
## 10 9 0.0000924 0.776
We can see that the neighbor 9 is the best value due to having the lowest RMSE and respective \(R^2\) value. We also computed the optimal number of committees and found it to be 78. Moving on, we can now produce a cubic regression model using our calculated parameters.
Cubist regression references: https://cran.r-project.org/web/packages/Cubist/vignettes/cubist.html#ensembles-by-committees
When looking at the compared RMSE values, we can determine that the cubist model performs the best in comparison to the other 5 models.
Through the use of our cubist regression model we were able to showcase the comparison between predicted and actual salary.
We ended up creating an interactive shiny app to show more in depth each player plotted above and their given attributes; such as percent cap hit, team savings, age, predicted cap hit, etc..
Through the use of the shiny app/ model we can see the top 5 most overpaid players are:
And our top 5 most underpaid players are:
Some limitations we faced in our research/data include not having data for physical attributes; such as speed, agility, strength, leadership etc.. The NHL does not provide puck tracking data, so we might not be able to find out the outcome of “50-50 battles”. There’s not a lot of defensive metrics available, so the model could be skewed more so towards offensive players. We had to omit obscure values for certain player statistics, which left some players out of our model; for example some players had NA’s and impossible hockey statistics. We wanted this application to be accessible for everyone, so we were required to use only publicly available data. In order to produce residual plots, we would have to create one for every individual node (which would be 33 plots in our final tree), which is insufficient for times’ sake. Furthermore, we have been unsuccessful in concluding whether or not our model has any assumptions; so assuming we did produce the residual plots, we don’t even have any assumptions to potentially violate. Overall, the cubist regression model is very new to us so there is a lot we could possibly expand upon in the future.
We have a few things we would like to look into in the future as well. The first thing doubles as both a limitation and a potential step to use in the future; to our knowledge, the committees (number of iterations/ trees produced) allowance only goes up to 100, but what if it doesn’t? We would be curious to see how the model changes if we increased the committees value. We would like to potentially categorize our cap hit variable into “underpaid”, “on-par”, and “overpaid” rather than simply displaying the dollar amount of predicted vs. observed cap hit difference. We would also be curious as to what the model might look like had we not performed any data transformations (i.e. no standardization or normalization). We would like to look into going a different route and use our own calculated expected values for the predictive model. In terms of future work for our shiny app, it would be a nice feature to add in a “fill in the blank” section, allowing users to input potential values for different player statistics. Lastly, we would also be interested in seeing how this modeling technique could be used for other sports salary predictions.
We would like to thank all of the people who had a hand in helping us throughout our research project such as Dr. Ron Yurko, our advisors: Ms. Katerina Wu and Mr. Caleb Pena, our TA’s Meg Ellingwood, Nick Kissel, YJ Chloe, Wanshan Li, and Kenta Takatsu, and lastly, our cohort members