Introduction:

In the dynamic ecosystem of professional basketball, the ability to discern the true value of a player often lies at the intersection of performance on the court and the economic dynamics of the sport. By venturing beyond traditional evaluation metrics, we seek to shed light on the nuances of the game and offer insights that could redefine team strategies. Our investigation delves into whether player compensation aligns with their performance, revealing potential hidden gems. Recognizing that basketball is a team sport, we also aim to identify optimal player combinations for a hypothetical team. This challenge has been tackled previously by methods developed by Maymin [1] and Kuehn [2], but we seek to combine the two team goals of positive surplus value and complimentary playstyles into a single R Shiny App. Through this project, we hope to offer a demonstration of a salary surplus model for an individual and an expected points model for a five-man lineup, the two of which can be expanded with more data.

Motivation

Teams want to optimize their cap space, so it is helpful to know how much they “should” pay a player.
Salary for the latest season is perhaps the best proxy for how much a player will demand for the upcoming season.
Teams want a way to see if a player is over-valued based on current output-based determinants of salary.
We also need to account for a team’s unique needs.

Our goals for this project are listed below, and are implemented in a Shiny App:

Estimate which players are over or under-valued by predicting salary surplus in dollars.

Evaluate how players fit with each other by predicting the probabilities of events that lead to an increase or decrease in expected points.

Provide a list of recommended players to add to a four-man lineup based on salary surplus and how much a hypothetical team is willing to pay a player, whilst accounting for complimentary playstyles.

A more expansive version of this tool could help decide which players to select based on their budget and unique needs.

Data:

Our project can be understood to be tackling two problems that each demanded different data.

The first problem is a salary surplus prediction model. A player’s salary surplus can be expressed as \(\delta_S = \hat{S} - S\), where \(\hat{S}\) is the player’s predicted salary, and \(S\) is the player’s true salary for the 2022-2023 season. Therefore, we need to predict \(S\) and use the residuals to evaluate a player’s surplus value.

Response Variable: Salary

Luckily, salaries for all players for the 2022-2023 season are publicly available. Once we extracted this information from Basketball Reference³, we merged it with individual basic and advanced statistics data also from Basketball Reference⁴. Below are listed predictors of interest:

Player information
- Position
- Age
- Years in League
How much a player plays
- Games
- Minutes Played
- Minutes Per Game
Shooting per 100 possessions
- Two-point Field-goals
- Three-point Field-goals
- Free Throws
- Points Per 100
Other basic stats
- Offensive and Defensive Rebounds
- Assists and Turnovers
- Steals and Blocks
- Personal Fouls
Advanced stats
- True Shooting Percentage
- Assist Percentage
- Usage Percentage
- Offensive Win Shares
- Defensive Win Shares
- Offensive Box-Plus-Minus
- Defensive Box-Plus-Minus
- Value Over Replacement Player

This data was used create a model predicting salary based off conventional measures of a player’s value.

The second problem is actually attempting to account for a team’s specific need. To make it a little more approachable, we considered the simple question of if you had a team of 4 players, and you had to choose one more player to complete the squad, who would it be? To answer this question, we need play-by-play data to retrieve what was the outcome of a possession with a certain offensive and defensive lineup. This data was acquired from Ramiro Bentes’ GitHub⁵.

Response Variable: Outcome of Possession

For possession-level data we need:

What the initial event for the possession was
- Defensive Non-Shooting Foul?
- Turnover?
- 2-Point Field Goal Attempt
- 3-Point Field Goal Attempt
If a shot was taken, did they make it and were they fouled
If they were fouled, how many shots did they make
If a shot was missed, who got the rebound
Indicator variables for who was on the court, separated into…
- Players on Offense
- Players on Defense

This data can be used to create an event tree of models to measure which players complement each others playstyles by influencing the probabilities of cetain outcomes.

Cleaning the NBA salary/stats dataset:

Gathered salary and performance statistics data from the 2022 - 2023 season
Removed redundant rows ensuing from Basketball-Reference’s practice of listing all teams a player has been associated with
- If a player played with multiple teams over the course of a season, we retained their statistics from teams on which they clocked the most minutes
Excluded players who averaged < 12 minutes per game to omit those with marginal on-court time
- Roughly 33.5% of players were removed based on this approach
Log-transformed player salary to normalize the skewed distribution and reduce the influence of outliers
- Salaries often have a wide range, from rookie contracts to multimillion-dollar superstar deals
In the end: 317 observations

Cleaning the NBA play-by-play dataset:

Gathered play-by-play data from the 2022 - 2023 season
Excluded players who were not in the top 250 players with the most possessions
Got rid of “garbage time” possessions
Limited free-throw situations to those where a shooting foul was drawn
Used subsets of data to answer conditional probability questions
In the end: 60,401 observations

Exploratory Data Analysis (EDA)

Before modeling, the distributions of variables within the NBA salary/stats dataset were examined to better understand the relationships that are present:

Salary Decomposition:

Salary Distribution by Position:

Salary Against Usage Percentage:

Methods:

1. Random Forest Prediction Interval

To generate predictive salary results and compute surplus value of individual players, we used the Random Forest Algorithm for regression modeling, whose advantage lies in averaging variable outputs to improve prediction accuracy and control over-fitting.

After eliminating irrelevant explanatory variables and shrinking the number of variables involved to minimize the influence of multi-collinearity, the random forest model was trained based on the following predominantly performance-based statistics:

Games
Minutes Per Game
Years in League
Two-point and Three-point Field-goals
Free Throws
Offensive and Defensive Rebounds
Assists and Turnovers
Steals and Blocks
Points Per 100
True Shooting Percentage
Assist Percentage
Usage Percentage
Defensive Win Shares
Offensive Box-Plus-Minus

To provide a range of plausible values and reflect potential uncertainty in salary prediction, we modified the model to produce centered interval estimates with 1/4 standard errors width. In terms of interpreting the predictive results, we reverted the effects of log transformation by exponentiating the standard errors of our predictions. True salary of individual players was then subtracted from the predictedsalary to compute for surplus values in potential contracts.

To provide a range of plausible values and reflect potential uncertainty in salary prediction, we modified the model to produce centered interval estimates, with 1/4 standard errors width for each tree split. In terms of interpreting the predictive results, we reverted the effects of log transformation by exponentiating the standard errors of our predictions. True salary of individual players was then subtracted from the predicted salary to compute for surplus values in potential contracts.

2. Clustering:

For interpretability purpose, we introduced the process of constructing player archetypes by observing, comparing performance-based statistics, and categorizing NBA players into clusters. This procedure allows us to investigate whether there exists significant gaps in player contract and surplus values among different player archetypes.

To account for potential ambiguity in the categorization results, we utilized a Gaussian Mixture Model (GMM). This model enabled us to provide probabilistic or “soft” assignments when grouping the players into clusters.

GMM results:

A VVE (ellipsoidal, equal orientation) model is selected with 3 components
Cluster size : 64, 90, 163

3. Scatter Plots to Assess Similarity between Novel and Existing Lineups:

To get an initial idea of how to conceptualize how unseen lineups might perform, we chose to plot a proposed lineup against the 1000 most frequently occurring lineups in the 2022-2023 season. We then computed pairwise distance between the two teams, summed them, and returned the top 20 most similar teams based on the smallest summed difference.

This was done in three-dimensions that decomposes a player’s play style into parts using per 100 possession stats:

1. Offensive Contribution - One number stat that sums a typical offensive stats and advanced offensive stats - Comprised of Assists, Turnovers, FGA, Offensive rebounds, Screen Assists

2. Defensive Contribution - One number stat that sums typical defensive stats and advanced defensive stats - Comprised of Deflections, # of Contested 2Pt Shots, Charges Drawn, Block, Steals, Defensive Rebounds

3. Spacing - One number stat that identifies players who are known to stretch the floor, i.e. players who shoot frequently from 3-pt range and are threats - Comprised of 3-pt attempts and 3P%

This idea was inspired by Todd Whitehead’s project featured on FanSided⁶.

4. Multinomial Logit Model Tree for Expected Points per Possession

A model predicting surplus value may not take into account how well a certain player may fit with four other players. To address this, we sought to create a model that predicted net expected points (\(NEP\)) for a possession given a lineup, which can be expressed as \(NEP = OEP - DEP\), where \(OEP\) and \(DEP\) are expected points while on offense and defense respectively. To measure interaction between players, we need a model that does not merely predict the probability of good or bad outcomes. Our model attempts to predict the probability of an event such as a 2-Point Field Goal Attempt, and then predicts the probability that the event will actually lead to a positive outcome: the shot being made, for example.

First we have to predict the probabilities of each of these events. For events at the same level, we can model the probability of one of those events occurring with Multinomial Logistic Ridge Regression. Therefore, we can create three Multinomial Logit models. First we can create a model that predicts what event happens at the start of a possession given the players on the court. For example, we can predict if a 2-Point FG attempt is taken based on which of the 250 player with the most possessions is on offense or defense. This is a method similar to what was done by Maymin¹.

\[ Pr(Event|Lineup) = \Sigma_{Player=1}^{250} \beta_{off, player} * On Offense_{Player} + \beta_{def, player} * On Defense_{Player} \]

We can model the immediate points (IP) earned by a line-up before the restart of a possession (Defensive Non-Shooting Foul or Offensive Rebound) as shown below:

\[ IP = 2*P(2PFGA) * \{P(Make +Foul|2PFGA) * 1.375 + P(Make|2PFGA)\ + P(Miss +Foul|2PFGA) * 0.75\} + \\3*P(2PFGA) * \{P(Make +Foul|3PFGA) * 1.375 + P(Make|3PFGA)\ + P(Miss +Foul|3PFGA) * 1.125\} \]

Then we can model the probability of a second chance (SC) (Defensive Non-Shooting Foul or Offensive Rebound) like below:

\[ Pr(SC = P(DNSF) + P(ORB) * \\\{ \{P(2PFGA) * \{P(Make +Foul|2PFGA) * (1 - P(Make|FT)) \} + \\\{P(Miss +Foul|2PFGA) * (1 - P(Make|FT)\} + P(Miss|2PFGA) \} + \\\{P(3PFGA) * \{P(Make +Foul|3PFGA) * (1 - P(Make|FT)) \} + \\\{P(Miss +Foul|3PFGA) * (1 - P(Make|FT)\} + P(Miss|3PFGA) \} \} \]

A possession could hypothetically go on forever if a team kept getting rebounds or getting fouled, but we cut it off at 3 second chances: \[ EP = \Sigma_{x=0}^\infty IP*Pr(SC)^x \\EP = IP + IP*Pr(SC) + IP*Pr(SC)^2 + IP*Pr(SC)^3 \]

On the basis of predicting for what actions are initiated at each possession, we fit another Multinomial Logistic Ridge Regression model on the probability of one of the four outcomes given 2-Point FG attempts. These potential outcomes include whether a 2-Point FG is made & fouled, made & not fouled, missed & fouled, or missed & not fouled. The same modeling procedure was also applied to 3-Point FG attempts. To account for rebounds, we chose a Logistic Ridge Regression model and for free throw attempts, a 75% rate was reasonably assumed.

Results:

1. Random Forest Prediction Interval

After tuning for tree complexity, the splitting rule, and the sampling scheme, the Random Forest Algorithm returns a regression model with its out-of-bag prediction error of 0.4486799 and R-squared value of 0.6405087. The following visualizations demonstrate the ranking of 18 involved variables based on their importance in providing prediction and the goodness-of-fit between the observed salary and the predicted salary.

Variable Importance:

Model Fit:

Based on the valuation of potential player salary from performance statistics, we are able to delve into the assessment of players’ cost-effectiveness by generating a surplus value, measured in dollars, for each individual player with a comparison between predicted salary and their current contract salary. A list of NBA players in 2022-23 season whose surplus values are ranked top in league are shown below:

Player List Preview:

Ranked by contract surplus:

##             player  g x3p x2p  ft orb  drb  ast stl blk tov  pts ts_percent
## 1  Kelly Oubre Jr. 48 3.3 7.5 4.9 2.0  5.7  1.7 2.1 0.6 2.0 29.9      0.534
## 2  Jordan Clarkson 61 3.7 7.3 4.8 1.7  4.2  6.5 0.8 0.3 4.5 30.5      0.558
## 3  Lauri Markkanen 66 4.2 7.8 7.3 2.7  9.2  2.6 0.9 0.8 2.7 35.5      0.640
## 4      Brook Lopez 78 2.7 6.9 3.0 3.2  7.3  2.0 0.7 3.9 2.2 24.9      0.630
## 5       Kyle Kuzma 64 3.5 7.7 3.8 1.2  8.9  5.2 0.8 0.6 4.1 29.5      0.544
## 6 Domantas Sabonis 79 0.5 9.5 5.7 4.4 12.6 10.0 1.1 0.7 4.0 26.4      0.668

Ranked by contract surplus lower bound:

##             player  g x3p x2p  ft orb drb  ast stl blk tov  pts ts_percent
## 1  Dennis Schroder 66 1.8 4.7 5.2 0.5 3.4  7.1 1.2 0.2 2.7 19.8      0.545
## 2  Kelly Oubre Jr. 48 3.3 7.5 4.9 2.0 5.7  1.7 2.1 0.6 2.0 29.9      0.534
## 3        Kris Dunn 22 1.4 8.3 3.4 0.8 7.7 10.4 2.1 0.8 2.9 24.4      0.606
## 4  Jordan Clarkson 61 3.7 7.3 4.8 1.7 4.2  6.5 0.8 0.3 4.5 30.5      0.558
## 5    Royce O'Neale 76 3.3 1.3 1.0 1.1 6.7  5.7 1.3 1.0 2.3 13.6      0.538
## 6 Kevin Porter Jr. 59 3.4 6.0 5.0 1.8 5.7  8.1 2.0 0.4 4.5 27.1      0.565

Ranked by contract surplus upper bound:

##             player  g x3p  x2p  ft orb  drb  ast stl blk tov  pts ts_percent
## 1    DeMar DeRozan 74 0.8 11.1 8.3 0.6  5.6  6.8 1.5 0.7 2.8 33.0      0.592
## 2  Kelly Oubre Jr. 48 3.3  7.5 4.9 2.0  5.7  1.7 2.1 0.6 2.0 29.9      0.534
## 3  Lauri Markkanen 66 4.2  7.8 7.3 2.7  9.2  2.6 0.9 0.8 2.7 35.5      0.640
## 4  Jordan Clarkson 61 3.7  7.3 4.8 1.7  4.2  6.5 0.8 0.3 4.5 30.5      0.558
## 5 Domantas Sabonis 79 0.5  9.5 5.7 4.4 12.6 10.0 1.1 0.7 4.0 26.4      0.668
## 6    Jalen Brunson 68 2.8  9.4 6.8 0.8  4.2  8.7 1.3 0.3 3.0 33.9      0.597

2. Clustering

Based on the three clusters returned upon utilizing the Gaussian Mixture Model, we are able to assign play-style archetypes to these groups by comparing average values of each performance-based parameter and integrating a certain level of basketball knowledge.

Descriptions for three archetypes we built are displayed below:

TRADITIONAL BIGS
- Rebounder & rim-protector
  - Highest total and offensive rebounds
  - Highest blocks
  - Highest true shooting percentage
  - Examples: Myles Turner, Clint Capela
PRIMARY SCORERS/INITIATORS
- Superstar & shot creator
  - Offensive skilled self-creators
  - Defensive versatility
  - Highest points, assists, usage, free throw attempts
  - Highest defensive and offensive contribution
  - Examples: Stephen Curry, LeBron James
ROLEPLAYERS
- Versatile wings
  - Efficient shooting
  - Highest game attendance
  - Examples: Tobias Harris, Andrew Wiggins

These three archetypes enable us to conduct a series of cross-comparison on salary and surplus distribution between NBA players with different play-styles. The following visualizations help identify potential divergence in salary and surplus values:

Salary Distribution by Archetypes:

Surplus Distribution by Archetypes:

Model uncertainty:

While the process of clustering contributes to the classification of player archetypes, it also comes with its own level of uncertainty. In the Gaussian Mixture Model we introduced, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 3 clusters. The plots below indicate the 3 players in each cluster who had the lowest/highest cluster assignment uncertainty.

3. Scatter Plots to Assess Similarity Between Novel and Existing Lineups

The results led to some interesting lineup comparisons. In the example shown below the proposed lineup of Stephen Curry, Chris Paul, Andrew Wiggins, Draymond Green, and Klay Thompson is most similar to a former Hornet’s Lineup comprised of LaMelo Ball, Terry Rozier, PJ Washington, Mason Plumlee, and Kelly Oubre Jr.

Top Ten Most Similar Lineups:

1. Kelly Oubre Jr, LaMelo Ball, Mason Plumlee, PJ Washington, Terry Rozier
2. Chimezie Metu, De’Aaron Fox, Malik Monk, Terence Davis, Trey Lyles
3. Al Horford, Derrick White, Jaylen Brown, Jayson Tatum, Marcus Smart
4. Chimezie Metu, De’Aaron Fox, Keegan Murray, Malik Monk, Terence Davis
5. Al Horford, Derrick White, Jaylen Brown, Malcolm Brogdon, Marcus Smart
6. Jalen McDaniels, Kelly Oubre Jr, LaMelo Ball, Mason Plumlee, Terry Rozier
7. Chimezie Metu, De’Aaron Fox, Harrison Barnes, Malik Monk, Terence Davis
8. Al Horford, Derrick White, Jaylen Brown, Jayson Tatum, Malcolm Brogdon
9. Chris Paul, Deandre Ayton, Devin Booker, Josh Okogie, Terrence Ross
10. Aaron Gordon, DeAndre Jordan, Jamal Murray, Kentavious Caldwell-Pope, Michael Porter Jr

Last Thoughts:

It is important to note that this tool was also able to find similar lineups for an entirely new 5 man lineup, in which no two players had every played with each other before. While this tool provided us some initial forethought into how a new lineup might perform (as you can then see how well the most similar lineup did across various metrics),we are aware of its limitations. Results will vary depending on how many lineups you choose to include, and what statistics you choose to quantify play style by. As such, we were motivated to use more precise and more predictive measures in attempts to assess novel lineup performance.

4. Multinomial Logit Model Tree for Expected Points per Possession

The reliability of our final prediction for net expected points was dependent on the accuracy of each constituent model. Here is how each ridge regression performed in terms of Mean Squared Error.

Below are the four offensive players with the highest coefficients for the predicted log odds of each outcome in the first multinomial logit model (predicting what happens after the start of the possession). If a player has a high coefficient for Turnovers, he is clearly contributing to his team negatively in that regard, as a turnover always results in 0 points for a possession. If a player has a high coefficient for Defensive Non-Shooting Fouls, he is gaining chances for more possessions. However, if he has a high coefficient for 2-Point Field Goal Attempts or 3-Point Field Goal Attempts, whether or not this is a positive depends on his effect on the probability that a shot is made or a rebound is grabbed, as well as who he is playing with.

Similarly, below are the four defensive players with the highest coefficients for the predicted log odds of each outcome in the first multinomial logit model. If a player has a high coefficient for Turnovers, he is clearly contributing to his team positively on the defensive end in some way (Turnover = 0 points). If a player has a high coefficient for Defensive Non-Shooting Fouls, he is giving the opponent another chance to score. However, if he increases the likelihood of 2-Point or 3-Point Field Goal Attempts, it could be a good or bad thing depending on how likely the opponent is to make the shot or get the rebound.

5. Player Recommender Shiny App

Our app takes five inputs: an interval of salary as a hard-cap and four players as members of a hypothetical 5-man lineup.

First, the app filters out players whose 2022-2023 salaries fall out of the suggested salary interval.
Then it sorts players based on the predicted surplus value of their contract in descending order and includes players whose indicated surplus is ranked in the top 20.
- All three columns that account for players’ surplus will be displayed : including a point estimate, a lower bound, and an upper bound
Finally the expected points model is used to fit on the remaining player list and select the players with the highest expected points given that the 4 player inputs are members of the 5-man squad.

Shiny App

Discussion:

The methods in this study could prove to be very useful for NBA teams to inform trade decisions. This project provides a potential way for teams to 1) evaluate if a player is currently over or undervalued and 2) see if a player is a good fit for the team. Based on the traditional player position metric, we cannot easily conclude whether players are disproportionally paid depending on their play-styles. Therefore, the visualizations on salary & surplus distribution given constructed archetypes are noteworthy as they reveal a significant divergence in salary rates but barely any difference in surplus values among different player archetypes.

Limitations

Only trained on data from the 2022-2023 NBA Season
In terms of the clustering process, we also did not establish a clear dividing line between archetypes, and it generally relies on comparing average values and integrating knowledge of basketball.
The data we were working with did not have shot distances for each shot taken, which meant we were losing a key predictor of the probability of making a shot.
Additionally, although we accounted for collinearity somewhat with ridge regression, it is still possible that our model will misassign contributions if a player was always on the court with other players.
This is especially an issue because our model does not take into account which player got fouled, turned over the ball, or shot the ball.
We only considered five man lineups and a salary for one player, without regard for the bigger picture for a basketball organization’s cap space and entire roster.

Next steps

Involve more complexity and variability in modeling by training on data from various seasons
Employ a better refined clustering model to construct more precise player archetypes
Attempt to implement Kuehn’s [2] method, predicts outcomes on propensities of individual players to commit an action.

Acknowledgements:

We extend our gratitude to Carnegie Mellon’s Statistics & Data Science Department for providing us with the opportunity to work on an exciting sports analytics project. Special thanks to Dr. Ron Yurko, Meg Ellingwood, Shamindra Shrotriya, and the TAs for their invaluable guidance in the SURE Program and CMSAC. Additionally, we are thankful to Maksim Horowitz, Director of Basketball Strategy Analytics for the Atlanta Hawks, for his insightful advice. We’d also like to thank Dr. Joseph Kuehn, professor at California Polytechnic State University, for taking the time to help us implement the complimentary playstyle event tree method. Working with everyone, including our fellow students, has been a rewarding and enriching experience. Thanks to everyone who made this project possible.

References:

[1] Maymin, A., Maymin, P., & Shen, E. (2013). NBA chemistry: Positive and negative synergies in basketball. International Journal of Computer Science in Sport, December.

[2] Kuehn, J. (2017). Accounting for complementary skill sets: evaluating individual marginal value to a team in the National Basketball Association. Economic Inquiry, 55(3), 1556-1578.

[3] https://www.basketball-reference.com/contracts/players.html

[4] https://www.basketball-reference.com/leagues/NBA_2023_per_poss.html

[5] https://github.com/ramirobentes/NBA-in-R/blob/master/2022_23/regseason/pbp/pbp_lineups.R

[6] https://fansided.com/2020/09/16/nylon-calculus-nba-lineup-comparisons/#1977339-comment-replies

University of Connecticut, mathew.chandy@uconn.edu ↩︎
Carnegie Mellon University, weiqianc@andrew.cmu.edu ↩︎
University of California, Berkeley, lolo0213@berkeley.edu ↩︎

Deal or No Deal? An NBA Recommender System for Team Composition and Salary Optimization

Mathew Chandy¹, Leo Cheng², Lauren Okamoto³,

July 27, 2023

Introduction:

Data:

Exploratory Data Analysis (EDA)

Methods:

1. Random Forest Prediction Interval

2. Clustering:

3. Scatter Plots to Assess Similarity between Novel and Existing Lineups:

4. Multinomial Logit Model Tree for Expected Points per Possession

Results:

1. Random Forest Prediction Interval

2. Clustering

3. Scatter Plots to Assess Similarity Between Novel and Existing Lineups

4. Multinomial Logit Model Tree for Expected Points per Possession

5. Player Recommender Shiny App

Discussion:

Limitations

Next steps

Acknowledgements:

References:

Deal or No Deal? An NBA Recommender System for Team Composition and Salary Optimization

Mathew Chandy1, Leo Cheng2, Lauren Okamoto3,

July 27, 2023

Introduction:

Data:

Exploratory Data Analysis (EDA)

Methods:

1. Random Forest Prediction Interval

2. Clustering:

3. Scatter Plots to Assess Similarity between Novel and Existing Lineups:

4. Multinomial Logit Model Tree for Expected Points per Possession

Results:

1. Random Forest Prediction Interval

2. Clustering

3. Scatter Plots to Assess Similarity Between Novel and Existing Lineups

4. Multinomial Logit Model Tree for Expected Points per Possession

5. Player Recommender Shiny App

Discussion:

Limitations

Next steps

Acknowledgements:

References:

Mathew Chandy¹, Leo Cheng², Lauren Okamoto³,