Introduction

The dataset of modern_RAPTOR_by_team contains RAPTOR data for every player “broken out” by team, season, and season_type since 2014. The dataset contains 23 variables (columns) and 570 rows (players), 571 total rows including the first row of variable names. Each observation corresponds to the NBA players “broken out” since 2014.

The types of variables in the data include:
1. Player Name
2. Player ID
3. Season
4. Season Type
5. Team
6. Possessions
7. Minutes Played
8-10. RAPTOR Box Offense/Defense/Total
11-13. RAPTOR OnOff Offense/Defense/Total
14-16. RAPTOR Offense/Defense/Total
17-19. WAR Total/RegularSeason/Playoffs
20-22. PREDATOR Offense/Defense/Total
23. Pace Impact

The RAPTOR variables correspond to Points above average per 100 possessions added by player. This data is categorized by on offense, defense, or in total. This data is based on by box score estimate, plus-minus data, or using both box and on-off components.

The WAR variables correspond to Wins Above Replacement, categorized by regular season games, or playoff games, or both.

The PREDATOR variables correspond to Predictive points aboce average per 100 possessions, categorized by on offesnse, defense, or both offense and defense.

The last variable (pace_impact) explains player impact on team possessions per 48 minutes.

NBA RAPTOR Dataset:
https://fivethirtyeight.com/features/introducing-raptor-our-new-metric-for-the-modern-nba

https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-raptor/modern_RAPTOR_by_team.csv

Main Research Questions

  • Question: What is the relationship between RAPTOR metric, WAR metric, and the PREDATOR metric by the season type? In other words, do players perform differently in regular season games vs. playoff games?

  • Question: Which player, based on the player stats and certain variables, performed the best given the time frame? Who performed the best in a specific season?

  • Question: Which metric out of RAPTOR, WAR, and PREDATOR is the best indicator of player impact on success? Do better stats measure up to real life success?

Graphs

library(tidyverse)
library(ggplot2)
library(gridExtra)
nba_raptor <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-raptor/modern_RAPTOR_by_team.csv")

nba_raptor_west <- subset(nba_raptor, team == c("LAC", "GSW", "HOU", "SAS", "POR", "OKC")) 

nba_raptor_east <- subset(nba_raptor, team == c("MIL", "IND", "BOS", "TOR", "WSH", "CLE")) #before: IND, TOR, WSH


nba.plot1 <- ggplot(nba_raptor_west, aes(x = season_type, y = raptor_total, color = team)) + 
  geom_boxplot() +
  labs(x = "Season Type", y = "Total Raptor Values", color = "Western Conference Teams", 
                                      title = "Raptor Values for Top Western Conference Teams by Season Type")

nba.plot2 <- ggplot(nba_raptor_east, aes(x = season_type, y = raptor_total, color = team)) + 
  geom_boxplot() +
  labs(x = "Season Type", y = "Total Raptor Values", color = "Eastern Conference Teams", 
                                      title = "Raptor Values for Top Eastern Conference Team by Season Type")

grid.arrange(nba.plot1, nba.plot2)

nba.plot3 <- ggplot(nba_raptor_west, aes(x = season_type, y = pace_impact, color = team)) + 
  geom_boxplot() +
  labs(x = "Season Type", y = "Player Impact", color = "Western Conference Teams", 
                                      title = "Player Impact Values for Top Western Conference Teams by Season Type")

nba.plot4 <- ggplot(nba_raptor_east, aes(x = season_type, y = pace_impact, color = team)) + 
  geom_boxplot() +
  labs(x = "Season Type", y = "Player Impact", color = "Eastern Conference Teams", 
                                      title = "Player Impact Values for Top Eastern Conference Team by Season Type")

grid.arrange(nba.plot3, nba.plot4)

We wanted to look at the performance of NBA players in regular and playoff games, and compare how variables like RAPTOR_total or Player Impact could change depending on game type. We subsetted the nba raptor dataset into two to only include the top six Western and Eastern teams, based on the number of playoff appearances since 2014. Thus, the six teams for each conference that made it to the postseason will be compared, both for specificity and sample size reasons. This makes it so that all 30 teams aren’t mashed into one graph, and also to ensure that the Playoff stats are represented by as much data as possible.

We can see the relationship between raptor values for each team, and how the season type affects raptor values. The Playoffs show a much bigger spread in the boxplot than Regular season for Western teams, as well as differences for the Eastern teams. and though this could be a result of more regular season games to average out closer to one another, something interesting to note is certain teams having lower raptor values on average, such as Portland and OKC. Other teams seem to have higher raptor values, like Milwaukee, Cleveland, or Boston.

In all, the spread is much larger in Playoffs than during the season for most teams, but a lot of teams show different Raptor values based on season type. This could explain certain teams underachieving in the Playoffs despite playing well during the regular season, like Portland and OKC, while other teams like the San Antonio Spurs or Cleveland stepped it up during the Playoffs and displayed better RAPTOR values.

For the Player Impact values, we see a similar trend with slightly larger spread during playoffs, but some teams like Portland or Milwaukee had better Player Impact values during the playoffs than during the regular season. So while similar trends occur with playoffs vs regular season, it doesn’t seem as if Pace Impact (player impact) and RAPTOR values are always necessarily correlated.

ggplot(nba_raptor, aes(x=war_total, y=raptor_total, color=season_type)) + 
  geom_point(alpha=0.5) + scale_color_manual(values=c("red", "blue"))+
  labs(title= "Distribution of WAR and RAPTOR by Season Type",
       x = "WAR Total",
       y = "RAPTOR Total",
       color = "Season Type")

To further analyze the relationship between metrics like RAPTOR and WAR during the regular season vs. the Playoffs, we took a look at the scatterplot for this relationship. We see an interesting feature of the Playoff games producing more concentrated coordinates near (0,0), with higher RAPTOR values and lower WAR values, yet the regular season games produced a lot of games with higher WAR values and lower RAPTOR values relative to playoff games. In all, the range of WAR values for playoffs was much smaller than the regular season game, but for the RAPTOR variable, we see a a lot more concentrated range of values. The RAPTOR variable corresponds to points above average per 100 possessions added by a player, and given our results, it seems as if playoff games have higher quality, better possession basketball than during the regular season. This makes sense, as teams playing in the postseason are more likely to be better teams than those that didn’t make the playoffs.

library(gridExtra)

east_war <- ggplot(data = nba_raptor_east, aes(x=war_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by WAR based on Season Type for Eastern Conference Teams",
       x = "WAR Total",
       y = "Player Impact",
       color = "Season Type")
west_war <- ggplot(data = nba_raptor_west, aes(x=war_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by WAR based on Season Type for Western Conference Teams",
       x = "WAR Total",
       y = "Player Impact",
       color = "Season Type")

grid.arrange(east_war, west_war)

east_rap <- ggplot(data = nba_raptor_east, aes(x=raptor_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by RAPTOR based on Season Type for Eastern Conference Teams",
       x = "RAPTOR Total",
       y = "Player Impact",
       color = "Season Type")

west_rap <- ggplot(data = nba_raptor_west, aes(x=raptor_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by RAPTOR based on Season Type for Western Conference Teams",
       x = "RAPTOR Total",
       y = "Player Impact",
       color = "Season Type")

grid.arrange(east_rap, west_rap)

east_pred <- ggplot(data = nba_raptor_east, aes(x=predator_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by PREDATOR based on Season Type for Eastern Conference Teams",
       x = "PREDATOR Total",
       y = "Player Impact",
       color = "Season Type")
west_pred <- ggplot(data = nba_raptor_west, aes(x=predator_total, y=pace_impact, color=season_type)) + 
  geom_density2d() +
  geom_point()+
  labs(title="Pace Impact by PREDATOR based on Season Type for Western Conference Teams",
       x = "PREDATOR Total",
       y = "Player Impact",
       color = "Season Type")

grid.arrange(east_pred, west_pred)

We want to explore more on how players perform differently in regular season and play off season. Contour maps can show joint distributions of player impact with each metric to see how players’ performances are grouped based on season type. We used the same subsets that were used for the boxplots for similar reasons: to compare each season types and so that all 30 teams are not presented in one graph. With the contour lines, we would be able to see how the modes are spread out, and we would be able to recognize any important overlaps based on their seasons.

For the first pair, the contour maps on pace impact and WAR further proves how players tend to play differently based on the season type. The Western Conference Teams especially show how the spread is perpendicular to each other. In addition, both Western and Eastern Conference teams show that the regular season’s spread is much wider than play off season.

For the second pair, the contour maps on pace impact and RAPTOR shows how the playoff season’s mode is bigger than the regular season’s mode for both conference teams, but we can also see how they are overlapped (especially for Western conference teams). This also occurs with the third pair of contour maps on pace impact and PREDATOR.

We also wanted to learn which teams were the best based on player statistics during this time frame (2014-2019). To do this, we plotted a dendrogram that grouped teams by its total player impact. The dendrogram should group teams with players that have better pace impact. In general, teams grouped together should have similar win records or team player quality.

nba <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-raptor/modern_RAPTOR_by_team.csv")
pace.scaled = nba$pace_impact/sd(nba$pace_impact)
pace.dist = dist(pace.scaled)
pace.hc = hclust(pace.dist, method="complete")
pace.dend = as.dendrogram(pace.hc)

library(dendextend)


teams <- c("ATL", "BKN", "BOS", "CHA", "CHI", "CLE", "DAL", "DEN", "DET", "GSW", "HOU", "IND", "LAC", "LAL", "MEM", "MIA", "MIL", "MIN", "NOP", "NYK", "OKC", "ORL", "PHI", "PHX", "POR", "SAC", "SAS", "TOR", "UTA", "WAS")

pace.dend = set(pace.dend, "labels", teams, order_value = T)
pace.dend = set(pace.dend, "labels_cex", 0.6)

plot(pace.dend)

From the above graph, we can see defined groups of teams that have similar player impact. For example, we can see that the LA Clippers (LAC), Miami Heat (MIA), Utah Jazz (UTA), Toronto Raptors (TOR), and Memphis Grizzlies (MEM) are grouped together, suggesting that these teams had similarly high player quality and impact. These teams also were generally pretty good teams from 2014-2019 so it makes sense that they have been grouped together. This trend also follows for generally bad teams. For example, we also can see that the Indiana Pacers (IND), New Orleans Pelicans (NOP), Houston Rockets (HOU), and Milwaukee Bucks (MIL) are all grouped together, suggesting that these teams were not as good and had generally not as good players.

In conclusion, we can see from the above graph which teams were similar in player quality. The LA Clippers (LAC), Miami Heat (MIA), Utah Jazz (UTA), Toronto Raptors (TOR), and Memphis Grizzlies (MEM) seem to be the strongest teams from the past 5 years in terms of player impact.

We were also interested in determing who were the best players on the best teams. To do this, we compared players from the last two NBA finals: Golden State Warriors Vs. Cleveland Cavaliers in 2018 and Golden State Warriors Vs. Toronto Raptors in 2019. We plotted all the major statistics using a multi dimensional scaling plot.

nba.2018 = subset(nba, season==2018 & (team=="GSW" | team=="CLE"))
nba.2018 = group_by(nba.2018, team)
nba.stats = subset(nba.2018,  select = c(raptor_box_offense,raptor_box_defense,raptor_box_total,raptor_onoff_offense,raptor_onoff_defense,raptor_onoff_total,raptor_offense,raptor_defense,raptor_total,war_total,war_reg_season,war_playoffs,predator_offense,predator_defense,predator_total,pace_impact))

#nba.stats = group_by(nba.stats, team)

nba.stats = apply(nba.stats, MARGIN=2, FUN=function(x) x/sd(x))
mds = cmdscale(d=dist(nba.stats), k=2)
nba.m=as.data.frame(nba.stats)
nba.m$mds1 = mds[,1]
nba.m$mds2 = mds[,2]

ggplot(data=nba.2018, aes(x=nba.m$mds1, y=nba.m$mds2, color=team, label=nba.2018$player_name, shape=season_type))+
  geom_point()+
  geom_text(check_overlap = TRUE)+
  #geom_density2d()+
  labs(x = "MDS1", y = "MDS2", color="Team", shape= "Season Type")

As we can see from the above plot for the NBA 2018 Finals, there are three players who stand out from the general group and are considered the best players on their teams: Lebron James, Stephen Curry, and Kevin Durant. It also seems the better players tended to have higher MDS_2 and lower MDS_1, and average players congregated around an MDS_1 and MDS_2 of 0. We can see that the best players are at the top, excluding Damian Jones, and the quality of players generally decreases as you go lower in the plot. The worst players from the teams were Jose Calderon, Ante Zizic, Damian Jones, and Chris Boucher.

Interestingly, the Warriors won the NBA finals, which shows that winning is not always about having the best player.

nba.2019 = subset(nba, season==2019 & (team=="GSW" | team=="TOR"))
nba.2019 = group_by(nba.2019, team)
nba.stats = subset(nba.2019,  select = c(raptor_box_offense,raptor_box_defense,raptor_box_total,raptor_onoff_offense,raptor_onoff_defense,raptor_onoff_total,raptor_offense,raptor_defense,raptor_total,war_total,war_reg_season,war_playoffs,predator_offense,predator_defense,predator_total,pace_impact))

#nba.stats = group_by(nba.stats, team)

nba.stats = apply(nba.stats, MARGIN=2, FUN=function(x) x/sd(x))
mds = cmdscale(d=dist(nba.stats), k=2)
nba.m=as.data.frame(nba.stats)
nba.m$mds1 = mds[,1]
nba.m$mds2 = mds[,2]

ggplot(data=nba.2019, aes(x=nba.m$mds1, y=nba.m$mds2, color=team, label=nba.2019$player_name, shape=season_type))+
  geom_point()+
  geom_text(check_overlap = TRUE)+
  #geom_density2d()+
  labs(x = "MDS1", y = "MDS2", color="Team", shape= "Season Type")

As we can see from the above plot for the NBA 2019 Finals, there are two players who stand out from the general group and are considered the best players on their teams: Kawhi Leonard (the name is not shown, rightmost blue point) and Stephen Curry. We can see that Kawhi Leonard was the best player during the Finals based on MDS_1 and MDS_2 metric, which makes sense because he was the Finals MVP and was pivotal in the Raptor’s Championship win. While Stephen Curry and Kevin Durant were the best Warriors players during the regular season, Draymond Green had the most impact during the post season, which makes sense because Stephen Curry and Kevin Durant were both injured during the playoffs and did not perform as well.

It seems the better players tended to have higher MDS_1 and MDS_2 that was closer to zero, and average players congregated around an MDS_1 and MDS_2 of 0. We can see that the best players are at the right, and the quality of players generally decreases as you go left in the plot. The worst players from the teams are Jeremy Lin, Malcom Miller, Eric Moreland, and Damian Jones. They also happen to be mostly Toronto Raptor players.

In conclusion, we can see that Stephen Curry and Kevin Durant have consistenly been the best players on the Warriors during the past two Finals, Kawhi Leonard was the best player for Toronto, and Lebron James was the best player for Cleveland.

nba.subset <- dplyr::select(nba, raptor_total, war_total, predator_total, pace_impact, season_type)

library(GGally)
ggpairs(data = nba.subset, columns = c(1:4), mapping = aes(color = season_type)) +
  labs(title = "NBA Pairs Plot")

There is a high correlation between metrics PREDATOR and RAPTOR with a 0.936, and a moderate correlation between metrics PREDATOR and WAR with a 0.456. This implies that predator and raptor are highly associated with each other, and can indicate each others values. We can also see that there is a higher correlation between player impact and RAPTOR for play off seasons than regular season. There is also a higher correlation between player impact and WAR for regular season than play off season, and a higher correlation between player impact and PREDATOR for play off season than regular season. Other metric relationships are similar for both seasons. Overall, the correlations between all metrics (RAPTOR, WAR, and PREDATOR) with player impact are very low. This implies that none of the metrics indicate player impact on success. However, we can see that RAPTOR shows the most indication to player impact out of all metrics.

Conclusion Paragraph

From exploring our three main research questions, and producing multiple graphs to analyze each idea of interest, we can come up with a couple conclusions and possible future work/questions to look further into.

It seems that players definitely do perform differently during the playoffs than during the regular season, as some teams are better suited to playoff-level basketball and perform better in the postseason, as seen in the first couple graphs, while other teams underperform when the stakes are high. In terms of the relationship between the metrics RAPTOR and WAR, we also see different range of values for each season type (playoff/regular) which indicates a different level of performance from teams in general depending on the type of game.

To explore which teams were the best, based on certain variables or time frame, we created graphs that showed that teams like Miami or Toronto, to name a few, were one of the stronger teams based on the last 5 years from the Pace Impact variable. Using a multi-dimensional scaling plot, we can also clearly see that players like Lebron James, Stephen Curry, Kevin Durant, and Kawhi Leonard were the best players in the last 2 Finals matchup, proving that these relatively well-known players performed up to standard during the Championship series.

To look at which metric was the best indicator of actual success, we produced a pairs plot that showed metrics RAPTOR and PREDATOR had a strong correlation, yet in general, all three metrics (RAPTOR, WAR, PREDATOR) had low correlation with Pace Impact (player impact). Thus, these metrics aren’t necessarily good indicators of player impact, but certain metrics have a stronger relationship with one another than others.