The National Basketball Association (NBA) is one of the world’s most well-known professional sports leagues, bringing together individuals from a variety of backgrounds to showcase extraordinary athletic potential. The data of NBA players offers a unique chance to study trends, evaluate performance indicators, and identify patterns that influence on-court success.
This report dives into player performance across numerous NBA seasons, looking at critical characteristics like scoring efficiency, advanced metric-based player clustering, and the impact of draft decisions on future performance. By researching these factors, we want to provide meaningful insights into player growth, team dynamics, and broader trends in professional basketball.
Our analysis is mostly built around the following three core research questions:
How do player performance metrics (e.g., points, rebounds, assists) vary across different ages and teams?
Can players be grouped based on performance metrics such as
net_rating
and ts_pct
, and what insights can
these clusters provide?
How does a player’s draft year or draft round correlate with their overall performance in subsequent seasons?
We believe that the project’s findings will not only shed light on individual and team performance, but will also serve as a basis for future sports analytics studies. We hope to create a cohesive story using visual and statistical analysis, supported by data-driven conclusions.
The dataset used in this project includes NBA player statistics spanning several seasons, providing a whole picture of both individual and team performance. It consists of a variety of attributes, which we have divided into three main categories:
player_name
: Name of the player.
team_abbreviation
: The team the player was affiliated
with.
age
: Age of the player during the season.
player_height
and player_weight
: Physical
attributes of the player.
college
and country
: Educational and
geographical background of the player.
gp
: Games played, the number of games played by a player
during the season.
pts
: Points, the total number of points scored by a
player in a season.
reb
: Rebouds, the total number of times a player
retrieves the ball after a missed field goal or free throw attempt.
ast
: Assists, the number of times a player passes the
ball leading directly to a score by a teammate.
net_rating
: Net rating of the player.
oreb_pct
and dreb_pct
: Offensive/defensive
rebound percentage, the proportion of available rebounds a player
successfully grabbed.
usg_pct
, ts_pct
and ast_pct
:
Usage percentage, true shooting percentage, and assist percentage,
representing involvement in plays, scoring efficiency, and playmaking
contributions.
draft_year
: The year in which the player was selected
during the NBA Draft.
draft_round
: The round in which the player was selected
during the NBA Draft.
draft_number
: The specific number at which a player was
picked within their draft round.
season
: The specific NBA season for the data entry.
There are 12845 observations and 22 variables in the dataset that allow both exploratory and advanced studies. It also contains over two decades of data in player performance, enabling longitudinal studies. This diverse dataset is a valuable resource for investigating the dynamics of NBA player performance and identifying patterns that impact success at both the individual and team levels.
The boxplot illustrates the distribution of points scored
(pts
) among various age groups in the dataset. The age
categories are categorized into five-year intervals, commencing from 18
to 43 years of age. Each box illustrates the interquartile range, median
scores and outliers for each age group. Furthermore, a distinct category
designated as NA signifies data with absent or unspecified age
values.
The median points scored increases slightly with age, peaking in the 28-33 age group, indicating that players’ abilities and efficiency will improve as they acquire experience. After this peak, scoring performance begins to fall in the older age groups (33-43), most likely due to decreased physical ability or playing time. Younger players had more score variability, indicating a mix of emerging stars and limited contributors, whereas older players have lower variability, potentially due to a smaller, more specialized pool. Outliers in all age categories show extraordinary performers who go beyond average expectations.
The heatmap shows the correlation between use percentage
(usg_pct
) and true shooting percentage
(ts_pct
) for each player in the dataset. Each cell in the
heatmap represents a bin, with color intensity reflecting the number of
players falling into that range of usg_pct
and
ts_pct
. Darker sections indicate more players, while
lighter regions imply less players.
The heatmap’s densest region shows that the majority of gamers
operate at moderate usage and and true shooting percentage levels, which
is concentrated around usg_pct
values between 0.15–0.30 and
ts_pct
values between 0.30–0.65. Few players have both high
usg_pct
(greater than 0.35) and high ts_pct
(greater than 0.75), as seen by the sparse bins in the upper-right
corner. This emphasizes the rarity of individuals that are extensively
involved in team play while maintaining high scoring efficiency, perhaps
identifying top-performing players. The lightly shaded bins in the
lower-left quadrant suggest fewer players with poor usage and
efficiency, as these players are less likely to contribute meaningfully
in games.
The time series plot compares the average points (‘avg_pts’) and average rebounds (‘avg_reb’) scored by NBA players over time. The x-axis represents the seasons, while the y-axis represents the average values of each metric. The teal line represents the data of average points per season while the coral line represents the data of average rebounds per season.
The average points per season show a slightly upward trend with some fluctuations, particularly in recent years. This suggests that offensive scoring improves over time, which might be due to shifts in gameplay strategies that favor faster-paced, higher-scoring games. In contrast, the average rebounds each season are pretty consistent over time, with minimal fluctuations. This consistency might suggest that, rebounding, as a performance statistic, is more influenced by physical attributes such as height and positioning rather than tactical changes. The minor fluctuations in rebounds do not follow the same clear upward trend seen in points. This may indicate that while scoring opportunities have increased, the availability of rebounding opportunities has remained relatively unchanged, likely tied to the number of missed shots or defensive setups.
net_rating
and ts_pct
, and
what insights can these clusters provide?The hierarchical clustering dendrogram shows how NBA players can be grouped based on their performance metrics, such as net_rating and ts_pct. The closer the branches are to each other, the more similar the players are in terms of these metrics. Players who perform similarly, in terms of efficiency and scoring, are grouped together, while those with differing performance profiles are placed further apart. This indicates that players can indeed be clustered based on key metrics like net_rating and ts_pct, allowing teams to identify groups of players who share similar performance characteristics.
From the dendrogram with 4 clusters, we can observe several clusters that reveal performance-based groupings. At the top of the tree, we see players with high net_rating and ts_pct, representing the most efficient and impactful players. These players are likely to be well-rounded contributors, excelling in scoring and overall team impact. In contrast, players further down the dendrogram, with lower net_rating and ts_pct, form distinct clusters that indicate less efficient performers. These players may contribute in certain areas (e.g., high scoring volume), but they may struggle with efficiency or have a more limited impact on the game.
The PCA plot reveals that players can be grouped based on performance metrics like net_rating and ts_pct, with PC1 capturing the overall efficiency of players. Players on the right side of the plot, with higher PC1 values, tend to be more efficient, demonstrating high net_rating and ts_pct. Conversely, players on the left side are less efficient, possibly with high volume but low efficiency. PC2 further separates players based on secondary factors, such as scoring style or role on the team.
The clusters in the PCA plot suggest that teams can identify high-efficiency players (right side of PC1) who contribute across multiple areas, and contrast them with less efficient players (left side of PC1) who may need improvement. The second principal component helps differentiate players based on their style, such as scoring versus all-around contributions.
In practice, teams can use this information to balance their roster, targeting efficient players for overall impact or addressing areas where certain players may need to improve.
This scatterplot examines the relationship between a player’s draft year and their net rating in subsequent seasons, with each point representing a player and a red regression line indicating the overall trend. The majority of net ratings cluster around 0, with a slight downward slope in the trendline suggesting a marginal decline in average net rating for players drafted in more recent years. Significant outliers highlight that exceptional or underperforming players can emerge in any draft year. This visualization is informative as it directly addresses the correlation between draft year and performance, showing the overall trend while emphasizing the variability in individual outcomes. It suggests that while draft year has a slight influence, other factors likely play a more substantial role in determining player success.
##
## Call:
## lm(formula = net_rating ~ draft_year, data = nba_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.391 -3.990 0.841 5.148 152.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.76470 23.66219 5.738 9.87e-09 ***
## draft_year -0.06867 0.01181 -5.816 6.20e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.89 on 10484 degrees of freedom
## Multiple R-squared: 0.003216, Adjusted R-squared: 0.003121
## F-statistic: 33.83 on 1 and 10484 DF, p-value: 6.2e-09
The regression model investigates the relationship between a player’s draft year and their net rating in subsequent seasons. The results show a significant negative relationship, with a coefficient of -0.06867 for draft_year (p-value < 0.001). This indicates that as the draft year increases (i.e., for players drafted in more recent years), there is a slight decrease in their average net rating, suggesting a marginal decline in performance over time. The intercept of 135.76 represents the estimated net rating for a player hypothetically drafted in year zero.
The boxplot shows how a player’s draft round correlates with their net rating in subsequent seasons. First-round picks exhibit higher median net ratings and less variability, indicating more consistent performance. Later-round picks show greater variability, with wider ranges and more outliers, suggesting less predictable outcomes. The plot also highlights that standout or underperforming players can emerge from any round, though they are more common in earlier rounds.This visualization effectively addresses the question by comparing performance across draft rounds, emphasizing the value of early picks while acknowledging potential in later rounds.
This faceted plot illustrates the relationship between a player’s draft year and net rating for each draft round. Each facet corresponds to a specific draft round, displaying how net rating trends have evolved over the years for players selected in that round. The red trendlines, derived from linear regression, show the general direction of performance over time within each round. The plot reveals that players in earlier rounds (e.g., Round 1) exhibit more consistent and higher net ratings compared to later rounds, where performance becomes more variable. The clustering of data in recent years highlights the increasing number of players drafted in later rounds, while outliers across rounds and years emphasize the variability in player success. This visualization is particularly informative as it simultaneously addresses draft year, draft round, and performance, allowing comparisons across rounds and trends within each round, thus offering a comprehensive view of the factors influencing player performance.
This analysis provides key insights into NBA player performance, highlighting several actionable takeaways for teams. We found that player performance, especially scoring efficiency, peaks in the late 20s to early 30s, emphasizing the value of players in their prime. Teams should prioritize younger players for development, while experienced players are valuable for peak performance contributions. Clustering analysis revealed that players can be grouped based on efficiency, with top performers demonstrating both high scoring and all-around contributions. Draft year and round correlations suggest that first-round picks generally perform better, but significant potential exists in later rounds, offering opportunities for teams to uncover hidden talent. Overall, these insights underscore the importance of strategic player evaluation, focusing on age, efficiency, and role-based contributions to build well-balanced rosters and optimize long-term success.
While this analysis provides valuable insights, there are several areas for future research. One important direction is incorporating game-specific data, such as home vs. away games or playoff performance, to better understand how players’ metrics vary under different conditions. Additionally, coaching influence on player performance is another area worth exploring. Analyzing how different coaching strategies impact player development could reveal important trends in player efficiency and growth.
Further, integrating advanced game metrics like player impact estimate (PIE) or box plus-minus (BPM) would provide a more comprehensive view of player performance beyond basic stats like points and rebounds. Including health and injury data could also enhance the analysis, as injuries often affect long-term player performance. Lastly, examining team-specific factors such as team chemistry and offensive/defensive systems could offer a better understanding of how individual players fit within their teams.