In the dynamic ecosystem of professional basketball, the ability to discern the true value of a player often lies at the intersection of performance on the court and the economic dynamics of the sport. By venturing beyond traditional evaluation metrics, we seek to shed light on the nuances of the game and offer insights that could redefine team strategies. Our investigation delves into whether player compensation aligns with their performance, revealing potential hidden gems. Recognizing that basketball is a team sport, we also aim to identify optimal player combinations for a hypothetical team. This challenge has been tackled previously by methods developed by Maymin [1] and Kuehn [2], but we seek to combine the two team goals of positive surplus value and complimentary playstyles into a single R Shiny App. Through this project, we hope to offer a demonstration of a salary surplus model for an individual and an expected points model for a five-man lineup, the two of which can be expanded with more data.
Motivation
Teams want to optimize their cap space, so it is helpful to know how much they “should” pay a player.
Salary for the latest season is perhaps the best proxy for how much a player will demand for the upcoming season.
Teams want a way to see if a player is over-valued based on current output-based determinants of salary.
We also need to account for a team’s unique needs.
Our goals for this project are listed below, and are implemented in a Shiny App:
- Estimate which players are over or under-valued by predicting salary surplus in dollars.
- Evaluate how players fit with each other by predicting the probabilities of events that lead to an increase or decrease in expected points.
- Provide a list of recommended players to add to a four-man lineup based on salary surplus and how much a hypothetical team is willing to pay a player, whilst accounting for complimentary playstyles.
A more expansive version of this tool could help decide which players to select based on their budget and unique needs.
Our project can be understood to be tackling two problems that each demanded different data.
The first problem is a salary surplus prediction model. A player’s salary surplus can be expressed as \(\delta_S = \hat{S} - S\), where \(\hat{S}\) is the player’s predicted salary, and \(S\) is the player’s true salary for the 2022-2023 season. Therefore, we need to predict \(S\) and use the residuals to evaluate a player’s surplus value.
Response Variable: Salary
Luckily, salaries for all players for the 2022-2023 season are publicly available. Once we extracted this information from Basketball Reference3, we merged it with individual basic and advanced statistics data also from Basketball Reference4. Below are listed predictors of interest:
This data was used create a model predicting salary based off conventional measures of a player’s value.
The second problem is actually attempting to account for a team’s specific need. To make it a little more approachable, we considered the simple question of if you had a team of 4 players, and you had to choose one more player to complete the squad, who would it be? To answer this question, we need play-by-play data to retrieve what was the outcome of a possession with a certain offensive and defensive lineup. This data was acquired from Ramiro Bentes’ GitHub5.
Response Variable: Outcome of Possession
For possession-level data we need:
This data can be used to create an event tree of models to measure which players complement each others playstyles by influencing the probabilities of cetain outcomes.
Cleaning the NBA salary/stats dataset:
Cleaning the NBA play-by-play dataset:
Before modeling, the distributions of variables within the NBA salary/stats dataset were examined to better understand the relationships that are present:
Salary Decomposition:
Salary Distribution by Position:
Salary Against Usage Percentage:
To generate predictive salary results and compute surplus value of individual players, we used the Random Forest Algorithm for regression modeling, whose advantage lies in averaging variable outputs to improve prediction accuracy and control over-fitting.
After eliminating irrelevant explanatory variables and shrinking the number of variables involved to minimize the influence of multi-collinearity, the random forest model was trained based on the following predominantly performance-based statistics:
To provide a range of plausible values and reflect potential uncertainty in salary prediction, we modified the model to produce centered interval estimates with 1/4 standard errors width. In terms of interpreting the predictive results, we reverted the effects of log transformation by exponentiating the standard errors of our predictions. True salary of individual players was then subtracted from the predictedsalary to compute for surplus values in potential contracts.
To provide a range of plausible values and reflect potential uncertainty in salary prediction, we modified the model to produce centered interval estimates, with 1/4 standard errors width for each tree split. In terms of interpreting the predictive results, we reverted the effects of log transformation by exponentiating the standard errors of our predictions. True salary of individual players was then subtracted from the predicted salary to compute for surplus values in potential contracts.
For interpretability purpose, we introduced the process of constructing player archetypes by observing, comparing performance-based statistics, and categorizing NBA players into clusters. This procedure allows us to investigate whether there exists significant gaps in player contract and surplus values among different player archetypes.
To account for potential ambiguity in the categorization results, we utilized a Gaussian Mixture Model (GMM). This model enabled us to provide probabilistic or “soft” assignments when grouping the players into clusters.
GMM results:
To get an initial idea of how to conceptualize how unseen lineups might perform, we chose to plot a proposed lineup against the 1000 most frequently occurring lineups in the 2022-2023 season. We then computed pairwise distance between the two teams, summed them, and returned the top 20 most similar teams based on the smallest summed difference.
This was done in three-dimensions that decomposes a player’s play style into parts using per 100 possession stats:
1. Offensive Contribution - One number stat that sums a typical offensive stats and advanced offensive stats - Comprised of Assists, Turnovers, FGA, Offensive rebounds, Screen Assists
2. Defensive Contribution - One number stat that sums typical defensive stats and advanced defensive stats - Comprised of Deflections, # of Contested 2Pt Shots, Charges Drawn, Block, Steals, Defensive Rebounds
3. Spacing - One number stat that identifies players who are known to stretch the floor, i.e. players who shoot frequently from 3-pt range and are threats - Comprised of 3-pt attempts and 3P%
This idea was inspired by Todd Whitehead’s project featured on FanSided6.
A model predicting surplus value may not take into account how well a certain player may fit with four other players. To address this, we sought to create a model that predicted net expected points (\(NEP\)) for a possession given a lineup, which can be expressed as \(NEP = OEP - DEP\), where \(OEP\) and \(DEP\) are expected points while on offense and defense respectively. To measure interaction between players, we need a model that does not merely predict the probability of good or bad outcomes. Our model attempts to predict the probability of an event such as a 2-Point Field Goal Attempt, and then predicts the probability that the event will actually lead to a positive outcome: the shot being made, for example.
First we have to predict the probabilities of each of these events. For events at the same level, we can model the probability of one of those events occurring with Multinomial Logistic Ridge Regression. Therefore, we can create three Multinomial Logit models. First we can create a model that predicts what event happens at the start of a possession given the players on the court. For example, we can predict if a 2-Point FG attempt is taken based on which of the 250 player with the most possessions is on offense or defense. This is a method similar to what was done by Maymin1.
\[ Pr(Event|Lineup) = \Sigma_{Player=1}^{250} \beta_{off, player} * On Offense_{Player} + \beta_{def, player} * On Defense_{Player} \]
We can model the immediate points (IP) earned by a line-up before the restart of a possession (Defensive Non-Shooting Foul or Offensive Rebound) as shown below:
\[ IP = 2*P(2PFGA) * \{P(Make +Foul|2PFGA) * 1.375 + P(Make|2PFGA)\ + P(Miss +Foul|2PFGA) * 0.75\} + \\3*P(2PFGA) * \{P(Make +Foul|3PFGA) * 1.375 + P(Make|3PFGA)\ + P(Miss +Foul|3PFGA) * 1.125\} \]
Then we can model the probability of a second chance (SC) (Defensive Non-Shooting Foul or Offensive Rebound) like below:
\[ Pr(SC = P(DNSF) + P(ORB) * \\\{ \{P(2PFGA) * \{P(Make +Foul|2PFGA) * (1 - P(Make|FT)) \} + \\\{P(Miss +Foul|2PFGA) * (1 - P(Make|FT)\} + P(Miss|2PFGA) \} + \\\{P(3PFGA) * \{P(Make +Foul|3PFGA) * (1 - P(Make|FT)) \} + \\\{P(Miss +Foul|3PFGA) * (1 - P(Make|FT)\} + P(Miss|3PFGA) \} \} \]
A possession could hypothetically go on forever if a team kept getting rebounds or getting fouled, but we cut it off at 3 second chances: \[ EP = \Sigma_{x=0}^\infty IP*Pr(SC)^x \\EP = IP + IP*Pr(SC) + IP*Pr(SC)^2 + IP*Pr(SC)^3 \]
On the basis of predicting for what actions are initiated at each possession, we fit another Multinomial Logistic Ridge Regression model on the probability of one of the four outcomes given 2-Point FG attempts. These potential outcomes include whether a 2-Point FG is made & fouled, made & not fouled, missed & fouled, or missed & not fouled. The same modeling procedure was also applied to 3-Point FG attempts. To account for rebounds, we chose a Logistic Ridge Regression model and for free throw attempts, a 75% rate was reasonably assumed.
After tuning for tree complexity, the splitting rule, and the sampling scheme, the Random Forest Algorithm returns a regression model with its out-of-bag prediction error of 0.4486799 and R-squared value of 0.6405087. The following visualizations demonstrate the ranking of 18 involved variables based on their importance in providing prediction and the goodness-of-fit between the observed salary and the predicted salary.
Variable Importance:
Model Fit:
Based on the valuation of potential player salary from performance statistics, we are able to delve into the assessment of players’ cost-effectiveness by generating a surplus value, measured in dollars, for each individual player with a comparison between predicted salary and their current contract salary. A list of NBA players in 2022-23 season whose surplus values are ranked top in league are shown below:
Player List Preview:
## player g x3p x2p ft orb drb ast stl blk tov pts ts_percent
## 1 Kelly Oubre Jr. 48 3.3 7.5 4.9 2.0 5.7 1.7 2.1 0.6 2.0 29.9 0.534
## 2 Jordan Clarkson 61 3.7 7.3 4.8 1.7 4.2 6.5 0.8 0.3 4.5 30.5 0.558
## 3 Lauri Markkanen 66 4.2 7.8 7.3 2.7 9.2 2.6 0.9 0.8 2.7 35.5 0.640
## 4 Brook Lopez 78 2.7 6.9 3.0 3.2 7.3 2.0 0.7 3.9 2.2 24.9 0.630
## 5 Kyle Kuzma 64 3.5 7.7 3.8 1.2 8.9 5.2 0.8 0.6 4.1 29.5 0.544
## 6 Domantas Sabonis 79 0.5 9.5 5.7 4.4 12.6 10.0 1.1 0.7 4.0 26.4 0.668
## player g x3p x2p ft orb drb ast stl blk tov pts ts_percent
## 1 Dennis Schroder 66 1.8 4.7 5.2 0.5 3.4 7.1 1.2 0.2 2.7 19.8 0.545
## 2 Kelly Oubre Jr. 48 3.3 7.5 4.9 2.0 5.7 1.7 2.1 0.6 2.0 29.9 0.534
## 3 Kris Dunn 22 1.4 8.3 3.4 0.8 7.7 10.4 2.1 0.8 2.9 24.4 0.606
## 4 Jordan Clarkson 61 3.7 7.3 4.8 1.7 4.2 6.5 0.8 0.3 4.5 30.5 0.558
## 5 Royce O'Neale 76 3.3 1.3 1.0 1.1 6.7 5.7 1.3 1.0 2.3 13.6 0.538
## 6 Kevin Porter Jr. 59 3.4 6.0 5.0 1.8 5.7 8.1 2.0 0.4 4.5 27.1 0.565
## player g x3p x2p ft orb drb ast stl blk tov pts ts_percent
## 1 DeMar DeRozan 74 0.8 11.1 8.3 0.6 5.6 6.8 1.5 0.7 2.8 33.0 0.592
## 2 Kelly Oubre Jr. 48 3.3 7.5 4.9 2.0 5.7 1.7 2.1 0.6 2.0 29.9 0.534
## 3 Lauri Markkanen 66 4.2 7.8 7.3 2.7 9.2 2.6 0.9 0.8 2.7 35.5 0.640
## 4 Jordan Clarkson 61 3.7 7.3 4.8 1.7 4.2 6.5 0.8 0.3 4.5 30.5 0.558
## 5 Domantas Sabonis 79 0.5 9.5 5.7 4.4 12.6 10.0 1.1 0.7 4.0 26.4 0.668
## 6 Jalen Brunson 68 2.8 9.4 6.8 0.8 4.2 8.7 1.3 0.3 3.0 33.9 0.597
Based on the three clusters returned upon utilizing the Gaussian Mixture Model, we are able to assign play-style archetypes to these groups by comparing average values of each performance-based parameter and integrating a certain level of basketball knowledge.
Descriptions for three archetypes we built are displayed below:
These three archetypes enable us to conduct a series of cross-comparison on salary and surplus distribution between NBA players with different play-styles. The following visualizations help identify potential divergence in salary and surplus values:
Salary Distribution by Archetypes:
Surplus Distribution by Archetypes:
Model uncertainty:
While the process of clustering contributes to the classification of player archetypes, it also comes with its own level of uncertainty. In the Gaussian Mixture Model we introduced, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 3 clusters. The plots below indicate the 3 players in each cluster who had the lowest/highest cluster assignment uncertainty.
The results led to some interesting lineup comparisons. In the example shown below the proposed lineup of Stephen Curry, Chris Paul, Andrew Wiggins, Draymond Green, and Klay Thompson is most similar to a former Hornet’s Lineup comprised of LaMelo Ball, Terry Rozier, PJ Washington, Mason Plumlee, and Kelly Oubre Jr.
Top Ten Most Similar Lineups:
1. Kelly Oubre Jr, LaMelo Ball, Mason Plumlee, PJ
Washington, Terry Rozier
2. Chimezie Metu, De’Aaron Fox, Malik Monk, Terence
Davis, Trey Lyles
3. Al Horford, Derrick White, Jaylen Brown, Jayson
Tatum, Marcus Smart
4. Chimezie Metu, De’Aaron Fox, Keegan Murray, Malik
Monk, Terence Davis
5. Al Horford, Derrick White, Jaylen Brown, Malcolm
Brogdon, Marcus Smart
6. Jalen McDaniels, Kelly Oubre Jr, LaMelo Ball, Mason
Plumlee, Terry Rozier
7. Chimezie Metu, De’Aaron Fox, Harrison Barnes, Malik
Monk, Terence Davis
8. Al Horford, Derrick White, Jaylen Brown, Jayson
Tatum, Malcolm Brogdon
9. Chris Paul, Deandre Ayton, Devin Booker, Josh
Okogie, Terrence Ross
10. Aaron Gordon, DeAndre Jordan, Jamal Murray,
Kentavious Caldwell-Pope, Michael Porter Jr
Last Thoughts:
It is important to note that this tool was also able to find similar lineups for an entirely new 5 man lineup, in which no two players had every played with each other before. While this tool provided us some initial forethought into how a new lineup might perform (as you can then see how well the most similar lineup did across various metrics),we are aware of its limitations. Results will vary depending on how many lineups you choose to include, and what statistics you choose to quantify play style by. As such, we were motivated to use more precise and more predictive measures in attempts to assess novel lineup performance.
The reliability of our final prediction for net expected points was
dependent on the accuracy of each constituent model. Here is how each
ridge regression performed in terms of Mean Squared Error.
Below are the four offensive players with the highest coefficients for the predicted log odds of each outcome in the first multinomial logit model (predicting what happens after the start of the possession). If a player has a high coefficient for Turnovers, he is clearly contributing to his team negatively in that regard, as a turnover always results in 0 points for a possession. If a player has a high coefficient for Defensive Non-Shooting Fouls, he is gaining chances for more possessions. However, if he has a high coefficient for 2-Point Field Goal Attempts or 3-Point Field Goal Attempts, whether or not this is a positive depends on his effect on the probability that a shot is made or a rebound is grabbed, as well as who he is playing with.
Similarly, below are the four defensive players with the highest coefficients for the predicted log odds of each outcome in the first multinomial logit model. If a player has a high coefficient for Turnovers, he is clearly contributing to his team positively on the defensive end in some way (Turnover = 0 points). If a player has a high coefficient for Defensive Non-Shooting Fouls, he is giving the opponent another chance to score. However, if he increases the likelihood of 2-Point or 3-Point Field Goal Attempts, it could be a good or bad thing depending on how likely the opponent is to make the shot or get the rebound.
Our app takes five inputs: an interval of salary as a hard-cap and four players as members of a hypothetical 5-man lineup.
The methods in this study could prove to be very useful for NBA teams to inform trade decisions. This project provides a potential way for teams to 1) evaluate if a player is currently over or undervalued and 2) see if a player is a good fit for the team. Based on the traditional player position metric, we cannot easily conclude whether players are disproportionally paid depending on their play-styles. Therefore, the visualizations on salary & surplus distribution given constructed archetypes are noteworthy as they reveal a significant divergence in salary rates but barely any difference in surplus values among different player archetypes.
We extend our gratitude to Carnegie Mellon’s Statistics & Data Science Department for providing us with the opportunity to work on an exciting sports analytics project. Special thanks to Dr. Ron Yurko, Meg Ellingwood, Shamindra Shrotriya, and the TAs for their invaluable guidance in the SURE Program and CMSAC. Additionally, we are thankful to Maksim Horowitz, Director of Basketball Strategy Analytics for the Atlanta Hawks, for his insightful advice. We’d also like to thank Dr. Joseph Kuehn, professor at California Polytechnic State University, for taking the time to help us implement the complimentary playstyle event tree method. Working with everyone, including our fellow students, has been a rewarding and enriching experience. Thanks to everyone who made this project possible.
[1] Maymin, A., Maymin, P., & Shen, E. (2013). NBA chemistry: Positive and negative synergies in basketball. International Journal of Computer Science in Sport, December.
[2] Kuehn, J. (2017). Accounting for complementary skill sets: evaluating individual marginal value to a team in the National Basketball Association. Economic Inquiry, 55(3), 1556-1578.
[3] https://www.basketball-reference.com/contracts/players.html
[4] https://www.basketball-reference.com/leagues/NBA_2023_per_poss.html
[5] https://github.com/ramirobentes/NBA-in-R/blob/master/2022_23/regseason/pbp/pbp_lineups.R
[6] https://fansided.com/2020/09/16/nylon-calculus-nba-lineup-comparisons/#1977339-comment-replies
University of Connecticut, mathew.chandy@uconn.edu↩︎
Carnegie Mellon University, weiqianc@andrew.cmu.edu↩︎
University of California, Berkeley, lolo0213@berkeley.edu↩︎