Introduction

For our final project, we have decided to create several visualizations on the 2021 to 2022 NBA Players Stats for the regular season, this dataset can be found here: https://www.kaggle.com/datasets/vivovinco/nba-player-stats

This dataset includes over 500 NBA players that are active in the league during the 2021 to 2022 season. For each player, the dataset also includes the following variables that indicate their performance as well as some basic information such as their age and the team they play on.

Rk : Rank, Player : Player’s name, Pos : Position, Age : Player’s age, Tm : Team, G : Games played, GS : Games started, MP : Minutes played per game, FG : Field goals per game, FGA : Field goal attempts per game, FG% : Field goal percentage, 3P : 3-point field goals per game, 3PA : 3-point field goal attempts per game, 3P% : 3-point field goal percentage, 2P : 2-point field goals per game, 2PA : 2-point field goal attempts per game, 2P% : 2-point field goal percentage, eFG% : Effective field goal percentage, FT : Free throws per game, FTA : Free throw attempts per game, FT% : Free throw percentage, ORB : Offensive rebounds per game, DRB : Defensive rebounds per game, TRB : Total rebounds per game, AST : Assists per game, STL : Steals per game, BLK : Blocks per game, TOV : Turnovers per game, PF : Personal fouls per game, PTS : Points per game

Research Questions

Given this information, we wish to use some visualizations to see if some of our assumptions can be proven. To start, the first question that we are interested in exploring is (1) how do different positions in the NBA affect the player’s performance? Our second question that we are interested in is (2) whether there is an association between NBA players’ average points per game and their effective field goal percentages. Our third question is (3) how the average points scored per game is affected by the players’ age?

Exploration and Visualization

Question 1

First, let’s apply principle component analysis (PCA) to 8 different quantitative variables from the data set: FG(Field goals per game), FT (Free throws per game), ORB(Offensive rebounds per game), DRB(Defensive rebounds per game), TRB(Total rebounds per game), AST(Assists per game), STL(Steals per game), TOV(Turnovers per game). The principal components are linear combinations of the 8 original variables. So, there is a linear relationship between the 8 original variables and the 2 principal components that we plotted below with the first principle component on the x-axis and the second on the y-axis, grouped by players’ position.

The graph above reflects that as FG, FT, TOV, STL, and AST increases, both principle components tend to decrease; thus, the coefficient for those variables are negative. On the other hand, as DRB, TRB, and ORB increases, component 1 tend to decrease as component 2 tend to increase. Additionally, the length of the lines indicate how strongly related the principal components are for each individual variables. So, we can conclude that AST has a larger negative coefficient than STL. The graph also implies that players who play center or power forward tend to have higher DRB, TRB, and ORB while players who play shooting guard or small forward or point guard are more associated with AST, STL, TOV, FT, and FG.

We can also create an elbow plot below that shows the same variables as the one above with the component number/dimensions on the x-axis and the proportion of variation on the y-axis.

We can see that the first principle component accounts for nearly majority of the variation in those 8 quantitative variables since it has the highest proportion of variance. In general, the variation accounted by each component adds up to the total variation in the selected variables, 100%. The proportion of variation starts to become flat when when k = 2. Thus, we should use the first two components, FG(Field goals per game), FT (Free throws per game), in our other analyses in order for a good amount of variation to be captured. Our above visual only plots the first two principle components, so we’re “neglecting” some data that may be better off.

Just out of curiosity, let’s take a look at what the most popular first and last names of NBA players are as well.

Here we create two word clouds of NBA players’ first and last names sized by the amount of times they appear. The red one is first names while the blue one last names. Here, a couple players that we can name seeing these first names are Josh Hart, James Harden, Kevin Durant, Jalen Brown, etc. Looking at the last names, a couple players we can think of are Draymond Green, Jrue Holiday, Jalen Brown, J.R. Smith, etc.

Now, we will examine the player’s average field goal percentages, then we’ll examine their free throw percentage, based on players’ position. We know the closer to the basket one shoot, the more likely they are to make that shot. Thus, we hypothesize that we should see centers in the league have the highest field goal percentage, followed by power forwards, small forwards, and so forth. As shown below, we will plot box plots of field goal percentages by player’s position.

Here, we see that our assumption is indeed true. From the box plot, we can see that the effective field goal percentage is highest for centers, and lowest for point guards, who usually finish a lot of their shots by shooting three-pointers, which statistically have a lower percentage of going in as opposed to two point shots. It is also important to zoom in on how small forwards and shooting guards have very similar field goal percentages. This occurs because these two positions are considered the most flexible positions on the court, where players can run around the court and make various shots of their choosing. Therefore, it is not surprising to see that their field goal percentages are very similar.

With this assumption answered, let’s explore another interesting assumption that NBA viewers usually have; the bigger of a position you play, the worse one is at shooting free throws. For example, just think about how horrendous Shaquille O’Neal, a center, was at the free throw line compared to someone like Steve Nash, a point guard. Very similar to our last plot, we will again make a box plot conditioned by player’s position, but this time for free throw percentage.

Here, we see that our assumption holds true as well. From the biggest to smallest positions (center, power forward, small forward, shooting guard, point guard), we see a gradual increase in the average free throw percentage. Some speculate that this decrease of shooting ability from size increase is due to the lowered view of the rim compared to shorter players, but we have seen a rise of NBA big man who are also great shooters.

Question 2

With our basic assumptions answered and have some basic knowledge of the league, let’s analyze how NBA players’ average points per game and their effective field goal percentages are associate with each other. From watching years of NBA, we suspect that the best scorers in the league must have decent field goal percentages as well, but let’s now test if this is true.

In regards to the effective field goal percentage and the average points per game, we created a contour plot, conditioned on position, to examine where NBA players stand in regards to these metrics. Immediately, we see that aside from a set of contour lines centered at 0% effective field goal and 0 points per game, there is an even larger set of contour lines concentrated around a mode with an effective field goal of 52% and 7 points per game. Moreover, we observed for this set of contour lines that the players who have a lower effective field goal percentage seem to score lower points per game (as evident by the points near 27% effective field goal and 2 points). Interestingly, we also observed that the top scorers seems to have a effective field goal percentage around 52%, and that those with even higher effective field goal percentages do not necessarily score higher. In addition, we observed that players with point guard (PG) and shooting guard (SF) positions tend to be on the left side of the contour lines, whereas center (C) players are mostly contained in the right side of the contour lines, again strengthening our assumption that those who shoot closer to the basket will be more effective at scoring.

Question 3

We are also interested in examining how the average points scored per game is affected by the players’ age. For this, we created a heat map comparing the two variables. We further amended this heat map to include the effective field goal percentage variable to add another dimension to the relationship.

We see that NBA players are concentrated around 24 years old and a score of roughly 3 points per game. From the heat map, we also see that the shade of purple around 25 years old and above is darker around the band of 0 to 10 points per game than the band around 10 to 30 points, which indicates that older players do not necessarily score more points. To this end, we note that LeBron James is a potential outlier as he score better and is older than most NBA players. Lastly, we see that the effective field goal percentage is wider in range (from 0% to 100%) for younger players, while older players tend to have moderate effective field goal percentage (from 25% to 50%).

Everyone gets old though, even Lebron, so it is natural for us to wonder whether the age of NBA player affect their [performance/position]. Therefore, we calculated the birth year of each of the players, and plotted a time series plot with birth year on the x-axis and number of players born in the year on y-axis.

## [1] 1981 2003

The above plot tells us that a majority of the players are born in the years between 1995-2000. Therefore, we are wondering whether the younger players tend to perform better than the older players. We proceed to investigate the correlation between birth year and average points per game and field goal percentage.

After examining the graph of Average Points per game by player’s birth year, we conclude that most of the players score an average of 10 points per game. There does seems to exist a weak positive correlation between points per game and age as younger players tend to score less than older players. However, we would need to make further investigation to determine whether it is significant or not. However, one more interesting pattern that we noticed is that players who are born in the 1985 is scoring much more points per game than players who are born in other years. Therefore, we are interested to see who those players are and potentially why they score so much more. Therefore, we create the following table:

##              Player  PTS
## 18  Carmelo Anthony 13.3
## 369    LeBron James 30.3

From here, we see that LeBron James scores an average amount of 30.3 points per game while the average is around 10. Also considering that there are only two players who are born in 1985, we conclude that it is an outlier in the data. However, we also conclude that although there more younger players in general, the older NBA players are not less efficient than the younger players are. To confirm whether this conclusion holds for other indicators, we create the following graph of effective field goals vs. birth year.

From this graph, we see that there does not exist a significant correlation between mean effective field goals of players and birth year. Therefore, we conclude that the older players are just as efficient at scoring as younger players. We also suspect that the peak at the year of 1982 is also an outlier, which is confirmed by the following table, as Joe Johnson is the only player who is born in the year of 1982, and he has only played one game so far.

##      Rk      Player Pos Age  Tm G GS MP FG FGA FG. X3P X3PA X3P. X2P X2PA X2P.
## 383 285 Joe Johnson  SG  40 BOS 1  0  2  1   1   1   0    0    0   1    1    1
##     eFG. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS birthyear
## 383    1  0   0   0   0   0   0   0   0   0   0  0   2      1982

Apart from efficiency, we are also interested to learn whether the birth year of the players affects what positions they play in the team. Therefore, we created the plot above, illustrating the number of NBA players by birth year grouped by positions.

From this, we see that each curve shows similar trends as each of them are skewed to the left, thereby to conclude that all the curves tell the similar trend: there tend to be more younger players than older players in each position, and the difference between each position is insignificant.

Conclusion

Through the various visualizations that we have shown above, we have explored a variety of interesting assumptions about the possible relationships between variables as well as other information in the data set. We have learned that counter-intuitively, player efficiency actually does not decrease with an increase in age, and intuitively, those who shoot closer to the basket score more efficiently, the players with positions that allows them to move more on offense are better at shooting free throws, players who score a lot are also decently efficient scorers, and finally, that there are more younger players than older players in the league. In the process, we explored the relationship between performance and position, performance with age, and even explored interesting things like the most common first and last names of players in the NBA.

Beside our three research questions, there were additional questions that may have been formed from this data set. An area of further investigation would be diving into deeper correlations amongst the variables and their positions and understanding why data may be skewed for different position. We briefly touched upon this with the PCA. Another topic of interest would be examining how turnovers and personal fouls per game affect the minutes played conditioned on the players’ position.