36-315 Final Project

Jeffrey Key, Nathan Yeager, Peyton Moffrat, Sukwoo Kwon

INTRODUCTION

The NBA is a game of statistics. We want to analyze how different measurable variables of the players can affect how they do in an NBA game such as the player traits, their backgrounds, and even the NBA itself. Furthermore, given the vast amount of analytical data and experts, we want to see if these scouts and sports analysts have done a good job in predicting good players through their draft picks.

METHODS

The data set contains data on each player who has been part of an NBA teams’ roster. This dataset contains demographic variables, biographical details, and basic box score statistics, consisting of 22 variables and a total of 12305 samples. In our study, we only consider a subset of the 14 most important variables.

Categorical variables (Qualitative)

  • player_name: Name of the player
  • team_abbreviation: Abbreviated name of the team the player played for
  • college: Name of the college the player attended
  • country: Name of the country the player was born in
  • draft_year: The year the player was drafted
  • draft_round: The draft round the player was picked
  • draft_number: The number at which the player was picked in his draft round
  • season: NBA season

Numerical variables (Quantitative)

  • age: Age of the player
  • player_height: Height of the player (in centimeters)
  • player_weight: Weight of the player (in kilograms)
  • gp: Games played throughout the season
  • pts: Average number of points scored
  • reb: Average number of rebounds grabbed
  • ast: Average number of assists distributed

Out of the 22 variables in the dataset, for our research, we utilized ast, reb, pts, gp, player_height, age, draft_number, draft_round, draft_year, country, and college. To analyze the data further, we modified the dataset by the addition of variables.

Modification

  • height_range: height divided in the range of 10 from starting from 160s to 230s
  • age_range: age divided in the range of 5 from 18-44

CHARACTERICS OF NBA PLAYERS VS PERFORMANCE

To understand the relationship between player height and performance statistics more thoroughly, we examined the side-by-side boxplot of height_range versus each performance statistics variable (gp, pts, reb, ast). height_range is a modified variable of height where height is divided into categories of range of 10 starting from 160s to 230s.

By using a side-by-side boxplot, we are able to focus on the relationship between height_range and the response variables, and moreover, compare between the height range groups. As seen from the plot, height_range and average number of rebounds has a clear positive relationship and the relationship between height_range and number of assists has a clear negative relationship; however, the relationship between points and height_range as well as games played and height_range does not show a clear relationship. Thus, we can conclude that the player height group has a strong positive correlation with rebounds and a strong negative correlation with assists, ultimately helping to understand how player profile affects player performance statistics.

We next investigate the same four performance statistics, focusing on the relationship between them and age. We split up age into groups of 5 years so that each decade will have 2 age groups each:

From the above we can see that the performance statistics peak with the 26-30 age range for each relationship. The distribution is very similar for each graph: there is a very slow increase in each of games played, points, rebounds, and assists until the 26-30 age range when each statistic starts slowly decreasing until a sudden drop in the 41-44 age range.

ORIGIN OF NBA PLAYERS VS PERFORMANCE

To understand how player origins affect player performance, we examined the categorical variables country and college. We only utilize one variable for each graph to construct a word cloud and ultimately find the common background that might link to player performance.

By utilizing Word Cloud, we are able to find the most frequently used words in a document. For Origin Country Word Cloud, it shows that the USA is the most common country for the players background, with no other country much standing out. For Origin Country Word Cloud without the USA, it shows that Canada and France are the most common countries, with Serbia, Turkey, Spain, Australia, Brazil, and Slovenia standing out after that.

We next make a word cloud of colleges that the players come from to see if any of them stand out.

We see that the most common background for NBA players to come from is that without a college. After that, we see a few bigger names in producing NBA players and potentially in producing better-performing athletes: North Caroliina, UCLA, Duke, Kentucky, Arizona, Conecticut, and Kansas. Overall, however, it does not seem like there will be much of a relationship between college and performance if investigated because of the large diversity of names and the fact that so many athletes do not come from a college.

NBA CHANGES THROUGHOUT THE YEARS

As we can see from this graph, all 3 stats have the same general trend over the 26 seasons; when the points go up, the assists and rebounds do as well, and vice-versa. We can see that in recent years, starting in around 2018 onward, players have been performing better than they ever have with a sharp increase in stats. The two seasons where we see a sharp downward spike in all stats, 1999 and 2012, were both seasons that were affected by a lockout that forced both of these seasons to be shortened, with 50 games and 66 games respectively.

As we can see from this graph, all 3 stats have the same general trend over the 26 seasons; when the points go up, the assists and rebounds do as well, and vice-versa. We can see that in recent years, starting in around 2018 onward, players have been performing better than they ever have with a sharp increase in stats. The two seasons where we see a sharp downward spike in all stats, 1999 and 2012, were both seasons that were affected by a lockout that forced both of these seasons to be shortened, with 50 games and 66 games respectively.

SCOUTS PREDICTION VS PERFORMANCE

After an preliminary analysis of how a variety of factors can affect an nba player’s performance, we were interested in how applicable these stats can be in the real world. That is to say, we were interested in how scouts and sports statistics analyst use these statistics in predicting how good a player is, especially given that many of them have access to much more advanced metrics than we do. We can do this through an analysis of draft picks and comparing it to performance. Drafting a player is conducted every year as a selection system where teams choose new players to play for their team. Generally, the higher the draft pick, the more sought after the players are. We can use this as an system to analyze how good the NBA predicts each player to be.

First, we conduct an analysis of how draft picks perform on each of the four general stats of games played, points, assists, and rebounds.

From these scatter plot diagrams, we can see a strong linear negative trend on the performance of the players later on the draft as opposed to earlier on the draft. It is also interesting to note that the top five draft picks seem to have a substantially higher output in terms of points per game, rebounds per game, and assists per game as opposed to the rest of the draft numbers. In all three categories, we see the outputs of these players lie outside of our confidence interval. Games played seems to be the only trend where the top five lottery picks that seem within the confidence interval. Thus, we conclude that draft number correlates strongly with the skill of the player and that the top 5 picks demonstrate a substantially higher amount of skill comparable to the average trend.

NBA drafting is also separated into rounds, where the first round represented that top tier of picks, the second round represents the second tier, and the third round, which has been discontinued, represents the bottom tier of picks. We can also conduct how clustered these groups are based on their performance and identify how the draft rounds compare to them.

The dendrogram shows the three clusters in the dendrogram. For each of the color, there is a relatively strong sign of homogeneity in terms of draft rounds. The bottom of the dendrogram represent the draft rounds of each player with the blue being the first round, the green being the second round, and the red being the third round. In particular, the right cluster tends to have players with a first round draft pick and the green tend to have more players with a second round draft pick. The third round draft pick lay somewhere in the second cluster. Overall, we can see a clear clustering between the first round drafts and the second/third round drafts. This suggests a strong indication that second/third round drafts perform comparably worse than the first round draft picks. These graphs suggest that the advanced metrics that scouts used for draft picks are a good predictor to success in the NBA.

CONCLUSION

In conclusion, we summarize our main findings as follows:

Comparing age to multiple player performance statistics shows that these statistics peak for the 26-30 age group, although in general it does not seem like the strongest predictor for player performance. We also concluded that the college that the player comes from also does not seem to be too strong of a predictor for player performance because of the diverse background of the players and because of the large number of players who do not come from a college at all.

Comparison between the height range groups shows that the player height group has a strong positive correlation with rebounds and a strong negative correlation with assists whereas games played and points does not show much of a relationship. Visualizing the most common country where the players were born is determined to be the USA. Excluding the USA from the samples, Canada and France are the most common countries, with Serbia, Turkey, Spain, Australia, Brazil, and Slovenia standing out after that.

Points, rebounds, and assists are all on an upward trend for the last 5 years thanks to players getting better and better. In addition to this, player heights are also decreasing at a staggering rate thanks in large part to the three pointer skyrocketing in usage.

Generally speaking, the lower a draft pick, the worse their performance is. This suggests that nba draft and analyst experts generally have it right when predicting how good players are. Given that drafting players utilize statistics and advanced metrics, we see the usefulness and practicality of how the metrics we’ve analyzed in our paper has impacted the NBA.