Introduction

RAPTOR, which stands for Robust Algorithm (using) Player Tracking (and) On/Off Ratings, is FiveThirtyEight’s new NBA statistic. There are 514 rows corresponding to an individual player in the 2020 NBA season. The 23 columns corresponding to information and stats about particular player’s season. The dataset includes categorical variables such as player_name, player_id, season (2020); it also include quantitative variables such as mp which is minutes the player played, raptor_offense: points above average per 100 possessions added by player on offense, raptor_defense: points above average per 100 possessions added by player on defense, war_total: wins above replacement between regular season and playoffs, and pace_impact, which is the player impact on team possessions per 48 minutes. The dataset also predicts the defense and offense points a player added and denote them as predator_defense, predator_offense. predator_total represents the predicted points added by player on both offense and defense.

We constructed statistical graphics and visualization to answer the following research questions:

Does a player’s defensiveness or offensiveness level determines his wins above replacement? If so, which one has more impacts on the result?
How much a player speeds up a game (measured by possessions/48 minutes), through their offense and defense ratings?
Are there any clusters among players in terms of the variables we’re interested in? Which players are similar to each other?

Graphs

We wanted to learn about whether a player’s defensiveness or offensiveness determines his wins above replacement, and which one has more impacts. This suggests we should examine raptor_defense, raptor_offense and war_total.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.6122385	0.0851707	18.929496	0e+00
raptor_defense	0.0887660	0.0173690	5.110608	5e-07
raptor_offense	0.2412389	0.0190695	12.650507	0e+00

We plotted a scatter plot and color the points by wins above replacement. It turned out that points on the right of the blue line are brigher than points on the left, which suggests that both higher offense and defense levels lead to higher war_total. In addition, looking at the right area divided by the blue line, it appears that there are more bright points on the top than on the bottom. This could sugguest that offensiveness has higher influence than defensivess. To verify this, we run a linear regression on war_total using both variables, the summary tells us that raptor_offense has a higher coefficients (0.24) than raptor_defense(0.09). Therefore, we conclude that player’s offensiveness level has more umpacts on war_total than the player’s defensiveness level.

We would also like to analyze a player’s “pace_impact,” or how much a player speeds up a game (measured by possessions/48 minutes), through their offense and defense ratings, “raptor_offense” and “raptor_defense.” Our hypothesis is that players with high offensive ratings will tend to increase pace and have a higher “pace_impact,” while players with high defensive ratings will tend to decrease the pace of the game and have lower “pace_impact.”

To do this, we graph a scatterplot of players with raptor_offense on the x axis, raptor_defense on the y axis, and the size of the point corresponding to pace_impact.

Note that we have removed 3 outliers with low playing time, sincere apologies to Marques Bolden, Max Strus, and Justin Wright-Foreman.

While the trends are not overwhelmingly obvious, we can still spot them.

We see that there are more negative raptor_offense observations than positive (it is hard to score in the NBA).

Eyeballing it, it does look like observations with positive raptor_offense have larger points and higher pace impact.

In particular, observations with positive raptor_offense and negative raptor_defense have on average even larger points.

It seems that raptor_offense can predict pace_impact, while raptor_defense is less meaningful (perhaps except for the case of positive offense and negative defense)

Let’s take a look at the means of various subsets of the data:

## [1] 0.03238585

## [1] 0.06504159

## [1] 0.0142659

## [1] 0.02949082

## [1] 0.03523583

## [1] 0.05017331

## [1] 0.08057811

## [1] 0.01746912

## [1] 0.01121521

The mean of pace_impact for our whole dataset is .034, so let’s use that as a baseline for looking at other subsets’ pace_impact.

We see that positive raptor_offense pace impact mean is .052, and positive raptor_offense + negative raptor_defense pace impact mean is .081, supporting our hypothesis.

However, positive raptor_defense pace impact mean is .033 and negative raptor_defense pace impact mean is .035. Moreover, the negative raptor_offense + positive raptor_defense case actually has a higher pace at .037, which is counter to our hypothesis. It seems that raptor_defense is not very predictive of pace_impact.

The smallest mean pace_impact among these comes from players with negative raptor_offense and negative raptor_defense. A possible explanation lies in game situation affecting player lineups. In games that effectively over (blowouts), coaches naturally tend to sub out star players and put in the bench. In blowout situations, the winning team “parks the bus,” playing passive to wind down the shot clock and maximize the chance of winning. This could explain why those with negative offense+defense ratings have lower pace.

Now, let’s try running a couple of t-tests on a couple of our most promising candidates, one to test pace_impact and positive raptor_offense and one to test pace_impact and positive raptor_offense + raptor_defense:

## 
##  Welch Two Sample t-test
## 
## data:  nba$pace_impact and nba_posO$pace_impact
## t = -0.4166, df = 297.78, p-value = 0.6773
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1869179  0.1216064
## sample estimates:
##  mean of x  mean of y 
## 0.03238585 0.06504159

## 
##  Welch Two Sample t-test
## 
## data:  nba$pace_impact and nba_posO_negD$pace_impact
## t = -0.43011, df = 111.77, p-value = 0.6679
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2702023  0.1738177
## sample estimates:
##  mean of x  mean of y 
## 0.03238585 0.08057811

Okay, with p-values of .8125 and .6781, we fail to reject the null and cannot say that the mean values of pace_impact differ among these different subsets. Was all that work for nothing? Maybe, maybe not, but basketball is a game of inches!

The war score tells us the wins above replacement for that player. We were curious as to how it related with raptor score, and whether and how the possessions and minutes played impacted this relationship. Thus, we created the scatter plot shown below:

There is an interesting relation here: the slope of the line depends on the number of possessions! And during our EDA we also noticed that possessions and minutes played are strongly positively correlated. We can see that for very low possessions, the WAR total is relatively unaffected by raptor total, but the impact is increasingly positive as the possessions increase.

In order to get an understanding of how the different scores such as raptor and predator are related, and how other variables affect this correlation, we created the faceted plot below:

## $x
## [1] "Predator total"
## 
## $y
## [1] "Raptor total"
## 
## $title
## [1] "Correlation between predator and raptor totals, given pace impact\n and possessions"
## 
## attr(,"class")
## [1] "labels"

An interesting observation is that raptor and predator scores are strongly positively correlated, regardless of the pace impact of that player. The possessions are positively correlated with the scores too, as can be seen in the color gradient in the graph.

We would also like to know the connections among players and used a mds to look for the modes

We colored the points by possession levels and found that there is an extreme negative relationships between the two coordinates for players with very low possessions. This patter is helpful to understand what connects them and will lead us to our next research question, how are players clustered.

Next, we would like to explore potential clusters among players. We selected a few variables that we are interested in (raptor_total, war_total, predator_total, minutes played) and visualized the dataset with dendrograms using “complete linkage”.

Apparently, we see patterns of cluster from above dendrogram. The four clusters identified by the dendrogram align well with the different possessions in the nba dataset. For instance, the right-hand cluster (covered by purple branches) contain low possessions and a few very low possessions. Meanwhile, the lefet-hand cluster contains high possesions (lightblue labels) and very high possessions. The dark green cluster in the middle contain possessions that are very low. The last cluster that is colored in green contains high possessions mostly, and a few low possessions. Overall, the dendrogram indicates that players tend to cluster based on their level of possessions, which makes sense because the number of possessions does affect player’s wins above replacement and points added in both defense and offense status.

Conclusion

In this study, we used various statistical visualizations and models to answer the research questions we are interested in. By analyzing the trend and patterns in our graphs, we derived the following conclusions: (1) The defensiveness and offensiveness levels of a player determine his wins above replacement, and level of offensiveness contributes more to wins above replacement than defensiveness level. (2) It looks like an individual player’s offensive capabilities is predictive of their effect on the game’s pace, or possessions/minute, while a player’s defensive capabilities are not. (3) Players tend to cluster by possessions played, which is associated with their wins above replacement and points added in both defense and offese sessions.

In terms of future, we believe incorporating additional categorical variables, such as team affiliation or role played on the team, could give us a better idea at the connection among players. We can also take into considerations of playoff games to analyze statistical differences between regular season and playoff games.

36-315 Project

Kevin Xiao, Mantek Singh, Lu Liu

Due Monday, May 4, 2020 (12pm eastern time)

Introduction

Graphs

Conclusion