Over the past 20 years, the game of basketball has gotten more and more popular on not only a national, but a global scale. With this increase in popularity, there has been many talks of evolution in the way the game is played, especially in the NBA, the highest level of basketball.
For our report, we will be using NBA Games Dataset from Kaggle, which has statistics on every player from every game played from 2003 to 2022. Within this data, we have access to different player statistics, such as points scored, rebounds grabbed, or assists handed out, along with team statistics, such as shooting percentages and more.
Our report will aim to answer the question of how the NBA has evolved within the years ~20 years. Specifically, we will be focusing on a few main questions:
To take a good look at how play style in the NBA has evolved, we will be looking at a few statistics that may be able to point us to some reasons or trends. When talking evolution in the NBA in recent years, the most common topic that is brought up is the rate at which teams score, and with that, shooting.
ggplot(pts, aes(x = as.factor(Season), y = Points)) + geom_boxplot(fill="slateblue", alpha=0.2) + labs(title = "Distribution of Points Scored from 2003-2022", y = "Points Scored", x = "Season")
As we can see from the boxplots above, looking at points scored by teams since 2003, scoring has increased drastically within the past 20 year. In 2003, the median for points scored was 93, which rose to 100 in 2013, and is all the way up to 110 in 2021.
ggplot(totalThrees, aes(x = as.factor(SEASON), y = totFG3A)) + geom_boxplot(fill="pink", alpha=0.2) + geom_line(data = sznAvg, aes(x = as.factor(SEASON), y = avg * 100, group = 1), color = "blue", alpha = 0.3) + geom_point(data = sznAvg, aes(x = as.factor(SEASON), y = avg * 100), color = "blue", alpha = 0.5, size = 3) + scale_y_continuous("Threes Attempted", sec.axis = sec_axis(~./100, name = "FG3 Percentage")) + labs(x = "Season", title = "Three Pointers Attempted and Three Point Percentage from 2003-2022")
Looking at threes specifically, with the pink boxplots, we notice that with the number of three points attempted per game have gone up significantly, from a median of just 23 three pointers attempted per game in 2003, to 70.5 three pointers attempted per game in 2021. However, looking at the blue points in the graph that maps 3-point field goal percentage, we notice that while there is a small increase, it has generally stayed around the same level while the number of three pointers attempted have increased.
Another way to look at how the game has evolved is looking at more auxilary statistics, such as rebounds and assists.
ggplot(season_stats) +
geom_line(aes(x = SEASON, y = Average_Assists, color = "Assists")) +
geom_line(aes(x = SEASON, y = Average_Rebounds, color = "Rebounds")) +
theme_minimal() +
labs(title = "Average Assists and Rebounds Per Game Over NBA Seasons",
x = "Season", y = "Average Per Game",
color = "Statistic")
As we can see, for each player, rebound averages (the blue line) have been relatively similar since 2003, hovering slightly above 4 rebounds per game. However, assists have risen slightly, from right around 2 assists per game, to 2.25 assists per game, which is not a lot, but still a 12.5% increase. Perhaps a better way to look at how assists and rebounds have evolved within the past 20 years could be through heatmaps.
ggplot(team_season_assists, aes(x = as.factor(SEASON), y = NICKNAME, fill = Average_Assists)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(
title = "Heatmap of Average Assists Per Game by Team Across Seasons",
x = "Season",
y = "Team",
fill = "Average Assists"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_text(size = 8, margin = margin(t = 5, b = 5)))
ggplot(team_season_rebounds, aes(x = as.factor(SEASON), y = NICKNAME, fill = Average_Rebounds)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(
title = "Average Rebounds Per Player by Team Across Seasons",
x = "Season",
y = "Team",
fill = "Average Rebounds"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
axis.text.y = element_text(size = 8, margin = margin(t = 5, b = 5)))
For both heatmaps graphing rebounds and assists, blue represents lower averages, and red represents high numbers. We can immediately notice that the average number of rebounds per player has not increased drastically, or even at all, across teams from 2003 to 2022. However, for assists, we can notice that in recent years, the average asissts per player per team has increased, indicated by a noticeable shift from blue to red as we go left to right.
Overall, we notice an increase in points, assists, and threes attempted. This definitely points to an increase in overall pace of play in the NBA, leading to higher outputs in the counting statistics. However, this is not reflect in rebounds, which also should increase with an increase in pace of play, so further exploration can be done to explore why this is the case.
Another common talking point among sports fans, especially in the NBA, is the idea of “load management”. Load management refers to players taking games when not fully injured due to caution, or simply just resting to maintain peak body performance for later (and oftentimes, more important) games.
ggplot(monthlyInjuries, aes(x = paste(Year, Month), y = Percentage, group = Year, color = as.factor(Year))) +
geom_line() +
geom_point() +
labs(x = "Year-Month", y = "Percentage of Player Injured/Rested", title = "Monthly Percentage of Injured/Rested Players from 2003-2022") +
theme_minimal() +
scale_x_discrete(breaks = unique(paste(monthlyInjuries$Year, monthlyInjuries$Month))[c(T, rep(F, length(unique(monthlyInjuries$Year)) - 1))]) + guides(color = FALSE)
Looking at the monthly percentage of players across all teams that did not play either due to injuries or rest, we notice that in the early 2000s up until the early 2010s, the percentage stays around the same. However, starting around 2015, we notice that a lot more players are out due to either injury or rest, but surprising it went back down in late 2021/2022. To explore this further, we can run a statistical test. Specifically, we can run a z-test to see if there is a difference in proportions of rested/injured players across two years.
prop.test(c, t, alternative = "two.sided")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c out of t
## X-squared = 32.544, df = 1, p-value = 1.165e-08
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.005426933 0.011295454
## sample estimates:
## prop 1 prop 2
## 0.04222072 0.03385953
Looking at the years 2004 and 2021 in specific, we notice that when we run our 2-sample test, we get a p-value of 1.165e-08, which is less than our standard alpha value of 0.05. This means that there is significant statistical evidence that there is a difference in proportions of rested players between the years 2004 and 2021.
While this statistical test and graph may not be enough to point to a specific test, it is definitely noteworthy that there are more players out due to rest or injury.
The last question we want to tackle is whether home court advantage really still is an advantage to this day. Although home court advantage undoubtedly exists, the game has changed drastically in recent years. Noticeably, the game has gotten more global, which means games draw more casual basketball fans, not just local fans of a team. Additionally, there has also been complaints of prices of tickets increasing. For example, the Golden State Warriors moved from Oracle Arena in Oakland to Chase Center in San Francisco in the 2019-2020 season. A lot of local fans complained about how tickets increased drastically, and a lot of the lifelong, diehard fans are from Oakland can no longer afford to go to games, and instead, local, more wealthy Silicon Valley people can attend instead.
First, let’s take a look at how scoring has changed generally between home and away teams between 2003 and 2022.
# Side-byside boxplot of the pointsHome with rotated and smaller x-axis labels
ggplot(pts_combined, aes(x = as.factor(Season), y = Points, fill = Team)) +
geom_boxplot(alpha = 0.2) +
xlab("Season") +
ylab("Points") +
labs(title = "Points Scored at Home and Away Per Season") +
facet_wrap(~ Team, scales = "free_y") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = rel(0.75)))
We notice immediately that the points for both home and away teams follow the same trend we saw earlier: it has increased. However, there is no noticeable difference in rate of change in points scored before home and away. Another way to look at this would be looking at point differential between the home team and away team.
# Plotting the connected line graph with a text label
ggplot(pts_differential, aes(x = as.factor(SEASON), y = avg_pts_differential, group = 1)) +
geom_line(col = "gold") +
geom_point(col = "purple") +
geom_text(data = filter(pts_differential, SEASON == 2020),
aes(x = as.factor(SEASON), y = avg_pts_differential, label = "A < 0.4 avg. point differential observed \n amidst the COVID-19 Lockdown Season \n known as 'The Bubble'"),
vjust = -1, hjust = 0.5, color = "black", size = 2) +
xlab("Season") +
ylab("Points Differential (ptsHome - ptsAway)") +
labs(title = "Average Points Differential Per Season") +
theme_minimal()
Here, we can notice that typically, home court advantage results in the home team scoring around 3.6 points per game more in 2003. This number hovers around 3 for a while, and dips to as low around 2.3 in 2014. Most notably, we notice that this number dips significantly in 2020. However, this can be explained, because when COVID hit the NBA, they moved to “The Bubble”, which was a neutral court, with no fans. Similarly, in 2021, although teams were back in their home stadiums, there were no fans at games, likely decreasing home court advantage slightly. In 2022, we see that the point differential went back up.
# Plotting side-by-side boxplots with NBA season on the x-axis, win-percentage on y-axis
ggplot(win_percentage_long, aes(x = as.factor(SEASON), y = Win_Percentage, color = Team, group = Team)) +
geom_line() +
geom_point() +
labs(x = "Season", y = "Win Percentage", title = "Average Home and Away Team Win Percentage Over the Years") +
scale_color_manual(values = c("gold", "purple")) +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = rel(0.75)))
Looking additionally at home and away win percentage, it follows something similar to what we saw earlier: generally there is a home court advantage, and this went away with COVID.
Overall, there is not much to suggest that home court advantage has really changed drastically. It likely still has a similar effect as it did in the past. It is worth noting, however, that the average points differential between the home team and away team has gone down slightly.
Our findings suggest that the NBA games have evolved in the past two decades, specifically in the realms of play style, organization-wide adjustments in player availability to strategize injury prevention, and lastly the changing climate of home court advantage.
The data illustrates that a noticeable shift in playstyle, as seen in how the median points scored per team has been rising consistently (from 93 in 2003 to 110 in 2021), which may suggest a combination of an increase in offensive output and a decrease in effective defense. It’s clear that the three-pointer attempts per game have surged, where we observed a median of 23 three-point attempts in 2003 to 70.5 three-point attempts in 2021. We observe that the three-point field goal percentage, however, has remained constant, which would suggest a more perimeter-oriented game (a focus on three-point shooting over two-pointers) than previous years. Furthermore, subtle changes are observed in assists and rebounds, where assists per game have shown a fair amount of increase in the recent years; the heat map supports this, and suggest that there has been a uptick in assists on average across the NBA in recent years.
Our analysis supports the idea that player availability, whether due to rest of injury, has seen a chance in recent years. We observed that from 2015 onwards, there has been an increase in percentage of players sitting out games. Our statistical analysis confirms that a significant difference exists between the year 2004 and year 2021 in player availability, which indicates that there is a noticeable difference in the way NBA teams manage load management for their players, where we see an emphasis on resting players more frequently since 2015.
Home court advantage has always been an influential part of winning games for NBA teams. Our exploration in the home and away performance reveals that home court advantage continues to impact the game, where teams perform better on their home court on average, which has been consistent and has not changed drastically across the NBA in recent years. Outside of the rare “removal” of home court advantage during the 2019-2020 NBA Season, we conclude that teams have a better shot at winning with a home court advantage.
Although our findings suggest areas in which professional basketball in the NBA has evolved and progressed, there is more to uncover about the “why” in how the NBA is changing, and to what extent and at what rate it is changing at. At the organization-wide scale, we’ve reached several conclusions that concern the average across all NBA teams, season per season. However, there is so much changing between each NBA team, season by season, that is missing from the picture here. There are several questions that we think are interesting, and would require a deeper exploration of data and statistical analysis:
A more elaborate NBA dataset with more advanced statistics, along with more advanced statistical analyses (i.e. removing confounding variables, achieving data with covariate balance, identifying causal relationships to explain changes in the NBA play-style) as mentioned before would allow us to gain a deeper understanding of how the game as evolved, with more precise findings and numbers that tell that story.