rm(list = ls())
setwd("/Users/angelachen/Desktop/36315/315project")
# Cleanup and Prep
library(tidyverse)
library(GGally)
library(readr)
library(dplyr)
library(ggpubr)
library(factoextra)

nba = read_csv("players.csv")
#players who didnt go to college are marked as "NaN
nba$College[is.na(nba$College)] <- "NaN"
nba = na.omit(nba)
nba$Height = as.numeric(substring(nba$Height,1,1)) + as.numeric(substring(nba$Height,4,4))*0.0833333
nba = nba[,-6]
nba2 = nba[,-c(1,3,7)] #remove categorical variables with > 30 categories

Introduction

This project explores a dataset of active players of NBA basketball games in the 2020 - 2021 season. Each observation of the dataset represents a player in the games. It contains variables about the player, such as their weight and height, their salary in 2020-2021, and the player’s position. It also has variables describing the performances for each player, such as the average number of points they scored in each game and the average number of times they assisted in scoring a goal.

For our purposes, we cleaned the data set by removing rows with missing values from the original Kaggle dataset. We also transformed some existing variables to better visualize and analyze relationships. The final dataset contains 394 rows and 10 columns representing the different variables, 7 of which are quantitative and 3 are categorical.

The quantitative variables of interest are:

Age: the player’s age in years
Height: the player’s height in feet
Weight: the player’s weight in pounds
Salary: the player’s salary in USD
Points: the average number of points the player scored per game
Rebounds: the average number of rebounds (i.e. retrievals of the basketball directly after a missed shot) per game
Assists: the average number of assists (i.e. passing of the ball to a teammate in a way that leads directly to a score by field goal) per game

The categorical variables of interest are:

Position: the player’s position on the court
Team: the team of the player
College: the college the player went to

Research Questions

Given the data, we were interested in the factors of each player’s success. Frankly, a player’s success can be measure by their salary and the points that they score in each game. Particularly, we wanted to explore three research questions.

First, looking at the performance of players, (1) how do body measurements including weight and height affect the player’s ability to win points, conditioned by the position?

Second, since we are college students with friends who are student atheletes, we were interested in (2) whether there is an association between a player’s salary and the college they went to. Do NBA players who went to top basketball colleges earn more than those who did not? Is going to (a top basketball) college really worth it for basketball players?

Finally, speaking of money, we wanted to look at some other possible predictors of a player’s salary. Specifically, our question was: (3) How do position, team, age, height, weight, and performance affect salary?

Exploration and Visualization

Question 1

We wanted to better investigate whether body measurements influence the player’s performance on the basketball court, conditioned by the position, which suggests we should examine Weight, Height, Position, and Points. We plotted these variables using boxplots and histograms.

EDA

First, we would like to explore the height and weight of NBA current players by their positions. Since we have one categorical variable and one numerical variable in both cases, a side-by-side boxplot makes the most sense for this comparison. In the below analysis, we have mutated the Position variable from 7 categories (C, F, G, PF, SF, PG, SG) to 3 categories (C, F, G).

nba_yu = nba
nba_yu$Po = as.factor(
  ifelse(nba_yu$Position=="G","G",
         ifelse( nba_yu$Position=="PG","G",
    ifelse(nba_yu$Position=="SG","G",ifelse(nba_yu$Position=="C","C","F")))))

nba_yu %>% group_by(Po) %>% summarize(mn=mean(Points,na.rm=T))

#boxplot
#height by position
ggplot(nba_yu, aes(x=Height))+
  geom_boxplot(aes(fill=Po))+
  labs( title = "Height by Position")

#weight by position
ggplot(nba_yu, aes(x=Weight))+
  geom_boxplot(aes(fill=Po))+
  labs( title = "Weight by Position")

We see that there is not a significant amount of variance in player’s height between positions, but there is a rather large variance in weight between positions. Specifically, C (Center) weight the heaviest, whereas PG and G are the lightest.

We then move on to exploring the number of points get by each position with the histogram shown below.

ggplot(nba_yu, aes(x=Points))+
  geom_histogram(fill="white", color="black",binwidth=3,
                 aes(y = after_stat(density)))+
  geom_density()+
  geom_vline(aes(xintercept=9.87, color="C"),size=1.5) +
  geom_vline(aes(xintercept=9.96, color="F"),size=1) +
  geom_vline(aes(xintercept=12.0, color="G"),size=1.5) +
  scale_color_manual(name = "statistics",
                     values = c(C = "blue", F = "red",
                                G="purple"))+
  scale_x_continuous(limits = c(0, 20))+
  labs( title = "Points by Position")

From the graph above, it is clear that there is no significant difference between C and F in the number of points, but G does get significantly more points according to our data.

Modelling

We now use position, height and weight to predict the points by each player. We start with the position (with 7 categories) vs height and weight, and see the results below.

#model
model2<-lm(Points ~ Position + Height + Weight, data=nba_yu)
summary(model2)

## 
## Call:
## lm(formula = Points ~ Position + Height + Weight, data = nba_yu)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1647  -4.6420  -0.9638   3.4497  20.1889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.60919    9.36517  -0.599 0.549564    
## PositionF   -2.41587    2.39671  -1.008 0.314089    
## PositionG    5.00909    2.78443   1.799 0.072807 .  
## PositionPF   1.31716    1.15507   1.140 0.254860    
## PositionPG   7.13457    1.66032   4.297 2.19e-05 ***
## PositionSF   3.65520    1.25197   2.920 0.003711 ** 
## PositionSG   5.58976    1.42432   3.925 0.000103 ***
## Height      -0.80163    1.21920  -0.658 0.511248    
## Weight       0.08268    0.02172   3.806 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.254 on 385 degrees of freedom
## Multiple R-squared:  0.07736,    Adjusted R-squared:  0.05819 
## F-statistic: 4.035 on 8 and 385 DF,  p-value: 0.0001274

We observe that PG, SG are actually both guard positions, but we do have a separate guard position, which is also slightly significant. It is possible that the guard position itself is especially associated with higher points, which is also what we’ve seen in the EDA histogram. Thus, we combine these three positions into a new indicator variable G, representing if a person is Guard (PG, SG, G) or not and obtain the regression results below. We keep the Weight and Height variables as is.

nba_yu$G = as.factor(ifelse(nba_yu$Position=="G",1,ifelse(
  nba_yu$Position=="PG",1,ifelse(nba_yu$Position=="SG",1,0))))

model21<-lm(Points ~ G + Height + Weight, data=nba_yu)
summary(model21)

## 
## Call:
## lm(formula = Points ~ G + Height + Weight, data = nba_yu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.472  -4.534  -1.263   3.512  20.560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.67344    8.50069   0.785  0.43290    
## G1           3.29315    0.89820   3.666  0.00028 ***
## Height      -1.08678    1.19159  -0.912  0.36231    
## Weight       0.04468    0.01753   2.548  0.01121 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.331 on 390 degrees of freedom
## Multiple R-squared:  0.04235,    Adjusted R-squared:  0.03498 
## F-statistic: 5.748 on 3 and 390 DF,  p-value: 0.0007424

As shown in the regression results above, we confirmed the hypothesis that the position guard is indeed related to points that a player scores. Also, even after controlling for the guard position, Weight is still significant at the 0.05 level, regarding the prediction of points scored.

Question 2

As the game of basketball has been developing rapidly in the past 2 decades, NBA has also been making new rules to adapt to these changes. One of the significant reforms that the league made in 2005 is prohibiting the practice of drafting high school players. The league believed attending college level basketball games and have an college experience can prepare the young prospects better in terms of both on and off the court. But is this policy effective? Are players who have been to U.S. colleges really more successful? So, for the second research question, we are interested in whether education plays a role in a player’s future salary. The goal of this research question is to find out whether the experience of attending U.S. colleges can truly make NBA players more successful in terms of a player’s salary.

nba_college <- nba %>%
  group_by(College)%>%
  summarise(Count = n())

nba3<-mutate(nba,
       college_level=ifelse (College=="NaN","N",
                             ifelse(College=="Kentucky"|nba$College=="Duke"|nba$College=="UCLA"
                                    |nba$College=="Kansas"|nba$College=="NorthCarolina"
                                    |nba$College=="Texas"|nba$College=="Arizona"
                                    |nba$College=="USC"|nba$College=="Villanova"
                                    |nba$College=="Washington","T","U")))

First, we construct a new variable college level. From the dataset, we find the top 10 colleges in terms of the number of NBA players that the school has produced. Then, we label the players who attended these colleges to have a college level of T (top basketball colleges). The other players who attended other colleges will be labeled to have a college level of U (average colleges). At last, for those who didn’t attend college in the U.S., we label them as N (never attended). Note that many players who belong to the N category are not from U.S. at all. They may have attended college in their home country. In our case, we are trying to assess the effect of attending a U.S. college, so we put everyone who didn’t attend college in the U.S. under the label N.

For this research question, we are going to examine whether this newly created variable college level has an effect on the mean and distribution of player’s salaries.

ggplot(nba3,aes(x=college_level))+
  geom_bar(aes(fill=college_level))+
  labs(x="Collge Level",
       y="count", 
       title = "College levels of players")+
  scale_fill_discrete(labels = c("Never attended", "Top college", "college"))

First, we construct a barplot of the counts of different college level categories to assess the distribution of college_level and have a better understanding of our variable in interest. From the plot, we can see players who never attended any U.S. colleges have the lowest count (55), players who attended average U.S. colleges have the largest count (221), and players who attended top basketball colleges fall in the middle (118).

ggplot(nba3,aes(x=Salary))+
  geom_density(aes(color=college_level))+
  geom_vline(xintercept = 11169364, color="green",linetype="dotted")+
  geom_vline(xintercept = 8438443, color="blue",linetype="dotted")+
  geom_vline(xintercept = 9744256, color="red",linetype="dotted")+
  labs(title = "Salary by College Level")+
  scale_color_discrete(labels = c("Never attended", "Top University", "University"))

This second plot shows the distribution of players’ salary, conditioning on their college level. We also added three dotted vertical lines to the plot to represent each group of players’ mean salary. From the plot, we can see that the distribution of salary for players who didn’t attend U.S. colleges and players who attended top basketball colleges are very similar. They both skew to the right and have a single peak near the left end. In contrast, the distribution for players who attended average U.S. colleges have slightly different distribution – it is more skewed to the right and has a much higher peak near the left end, compared to the other two distributions. For the mean salaries, we see that players who attended top basketball colleges have the highest mean, followed by players who didn’t attend any U.S. colleges. Players who attended average U.S. colleges have the lowest mean.

However, we think that we couldn’t make any statistical conclusions just based on observing the above graphs. We need to conduct additional statistical tests to have enough evidence.

Statistical Tests

t.test(subset(nba3,college_level=="T")$Salary,
       subset(nba3,college_level=="U")$Salary)

## 
##  Welch Two Sample t-test
## 
## data:  subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "U")$Salary
## t = 2.1975, df = 185.08, p-value = 0.02922
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   279195.5 5182645.8
## sample estimates:
## mean of x mean of y 
##  11169364   8438443

t.test(subset(nba3,college_level=="T")$Salary,
       subset(nba3,college_level=="N")$Salary)

## 
##  Welch Two Sample t-test
## 
## data:  subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "N")$Salary
## t = 0.82226, df = 118.74, p-value = 0.4126
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2006791  4857006
## sample estimates:
## mean of x mean of y 
##  11169364   9744256

t.test(subset(nba3,college_level=="N")$Salary,
       subset(nba3,college_level=="U")$Salary)

## 
##  Welch Two Sample t-test
## 
## data:  subset(nba3, college_level == "N")$Salary and subset(nba3, college_level == "U")$Salary
## t = 0.87008, df = 78.689, p-value = 0.3869
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1681643  4293269
## sample estimates:
## mean of x mean of y 
##   9744256   8438443

We used three separate t-tests to test whether the difference in mean salaries across different college level is significant. The null hypotheses for all three tests are “the true difference in means is equal to 0”. From the above results, we can see that only the first test yields a p-value which is significant at the level of 0.05. Therefore, for the difference in means, what we are able to conclude is: there is enough evidence to suggest that the difference in mean salary for players who attended a top basketball college and players who attended a average U.S. college is not zero. For the other two tests, we fail to reject the null hypotheses.

ks.test(subset(nba3,college_level=="T")$Salary,
        subset(nba3,college_level=="U")$Salary)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "U")$Salary
## D = 0.14707, p-value = 0.0816
## alternative hypothesis: two-sided

ks.test(subset(nba3,college_level=="T")$Salary,
        subset(nba3,college_level=="N")$Salary)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "N")$Salary
## D = 0.092744, p-value = 0.9116
## alternative hypothesis: two-sided

ks.test(subset(nba3,college_level=="N")$Salary,
        subset(nba3,college_level=="U")$Salary)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  subset(nba3, college_level == "N")$Salary and subset(nba3, college_level == "U")$Salary
## D = 0.1747, p-value = 0.1331
## alternative hypothesis: two-sided

Next, we used three separate KS two-sample tests to test whether the distributions of players’ salary condition on college level are the same. The null hypotheses for all three tests are “the distributions of the two groups of players’ salary are the same”. According to the test results, none of the three tests yield a significant p-vale on the level of 0.05. This means we cannot reject any of the three null hypotheses. We don’t have enough evidence to suggest that the conditional distributions of players’ salary based on college level are different with each other.

In conclusion, we don’t think that education is playing an important factor on a player’s success in terms of salary paid. The conditional distributions of salary based on college level are not statistically significant, and the difference in means are not statistically significant either, with the exception of players who attended top basketball colleges vs players who attended average colleges. Our findings suggest that if a player attended a U.S. college, then whether the college has a great basketball program may make a difference on how much the play will earn in the NBA. But otherwise, attending U.S. colleges doesn’t seem to have a huge effect on players’ success. One reason could be that NBA has became more and more international in the recent years. Scouts and managers are paying more and more attention on the overseas players and have faith in their abilities to play in the current league.

One limitation we have for this research question was that we don’t have the nationality of players as a variable in this dataset. In future work, if we could amend our dataset with more columns such as players’ nationality and the colleges that these international players attended, we could examine the effect of education level on players’ success in a more accurate and more comprehensive way. Additionally, we could make improvements on the way we construct the college_level variable. A more universally agreed/modified method of identifying top basketball colleges could change the distribution of college_level and yields different results for us.

Question 3

We also wanted to investigate whether the other variables affect the salary of the player. To understand to what extent is the players’ salary affected by other variables, we start with this scatter plot displaying the three major performance variables: points, rebounds, and assists. Since some players can play at multiple positions, we merge the positions into three types for simplicity: center, forward, and guard.

nba1 = nba
nba$Position %>% replace(., .=="PG"|.=="SG", "G") %>% 
  replace(., .=="PF"|.=="SF", "F") %>% factor() -> nba1$Position

m_1 = lm(Salary ~ Points + Points * Position, data = nba1) 

m_2 = lm(Salary ~ Rebounds + Rebounds * Position, data = nba1)

m_3 = lm(Salary ~ Assists + Assists * Position, data = nba1)

points <- ggplot(data = nba1, aes(x = Points, y = Salary, color = Position)) +
  geom_point(alpha = 0.6) + 
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
  labs(title = "Salary and Points")
rebounds <- ggplot(data = nba1, aes(x = Rebounds, y = Salary, color = Position)) +
  geom_point(alpha = 0.6) + 
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
  labs( title = "Salary and Rebounds")
assists <- ggplot(data = nba1, aes(x = Assists, y = Salary, color = Position)) +
  geom_point(alpha = 0.6) + 
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
ggarrange(points, rebounds, assists, ncol = 1)+
  labs( title = "Salary and Assists")

It’s obvious that all three variables have positive relationships with salary. Besides, the slope between points and salary are almost the same for all positions, while in the plot of rebounds, guards have a higher slope whereas center players have a lower slope. For assists, guards have a lower slope than the other two positions.

summary(m_1)

## 
## Call:
## lm(formula = Salary ~ Points + Points * Position, data = nba1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19516616  -3551919    -38982   2840190  23872957 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -3400771    1536149  -2.214   0.0274 *  
## Points            1240332     134196   9.243   <2e-16 ***
## PositionF         -602134    1831166  -0.329   0.7425    
## PositionG         -432813    1845279  -0.235   0.8147    
## Points:PositionF    28295     159002   0.178   0.8589    
## Points:PositionG   -62105     153360  -0.405   0.6857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6518000 on 388 degrees of freedom
## Multiple R-squared:  0.5927, Adjusted R-squared:  0.5874 
## F-statistic: 112.9 on 5 and 388 DF,  p-value: < 2.2e-16

summary(m_2)

## 
## Call:
## lm(formula = Salary ~ Rebounds + Rebounds * Position, data = nba1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19313036  -4934994  -1010494   3090134  34211728 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2960022    2264359  -1.307  0.19191    
## Rebounds            1819569     315050   5.775 1.57e-08 ***
## PositionF          -1847052    2803623  -0.659  0.51041    
## PositionG            794100    2656506   0.299  0.76516    
## Rebounds:PositionF  1360537     477842   2.847  0.00464 ** 
## Rebounds:PositionG  2013254     494105   4.075 5.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8180000 on 388 degrees of freedom
## Multiple R-squared:  0.3585, Adjusted R-squared:  0.3502 
## F-statistic: 43.37 on 5 and 388 DF,  p-value: < 2.2e-16

summary(m_3)

## 
## Call:
## lm(formula = Salary ~ Assists + Assists * Position, data = nba1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -21800713  -4059112   -858652   3460331  27252577 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2358709    1313314   1.796   0.0733 .  
## Assists            4493650     674337   6.664 9.17e-11 ***
## PositionF         -2370731    1604964  -1.477   0.1405    
## PositionG         -2517267    1659505  -1.517   0.1301    
## Assists:PositionF   259453     778690   0.333   0.7392    
## Assists:PositionG -1271763     722231  -1.761   0.0790 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7389000 on 388 degrees of freedom
## Multiple R-squared:  0.4765, Adjusted R-squared:  0.4698 
## F-statistic: 70.63 on 5 and 388 DF,  p-value: < 2.2e-16

To determine this, we conducted simple linear regressions on these three variables, and the results indicate that the difference between the slope of rebounds in all three positions and the difference in the slope of assists in guards are significant on at least a 0.1 significance level.

Given this, we can conclude that the the ability of scoring points are similarly important for players in all positions to earn high salary. On the other hand, getting assists are less important in the salary of guard players, and the number of rebounds are more important for guard players , while being less important for center players.

Moving forward, we conducted a PCA analysis to look at the difference between positions among all quantitative variables. We made a biplot of the first two dimensions and color the points by their position.

nba_quant = select(nba1, c(4,5,6,8,9,10,11))
nba_pca <- prcomp(nba_quant, center = T, scale. = T)
fviz_pca_biplot(nba_pca, label = "var",
                alpha.ind = 0.25,
                alpha.var = 0.75,
                habillage = nba1$Position , pointshape = 19)

From this biplot, we can notice that the center of points in all positions are at about the same in the first dimension, but are very different in the second dimension. The guard group has a positive average dim-2 value, the center group has a negative average dim-2 value, and the forward group has its average near zero. For the group of guard players, the value in dimension two is higher than the other two variables. In fact, very few points from center and forward group have a positive value in dimension two. This indicates that guard players may have higher number of assists, lower number of rebounds, lower weight, and lower height. For center players, their negative value in dimension two may indicate greater height and weight, more rebounds, as well as lower assists.

Conclusion

In this project, we have shown that weight is significant in the prediction of points scored for players in all positions. However, guard players do score more points in general. Further, we discovered that education in the U.S. does not play an important factor on a player’s success in terms of salary paid. In other words, going to a top basketball college in the U.S. is not related to making more money through NBA. Finally, we found out that players of different positions differ in body measurements and performance actions on the basketball court. In particular, guard players may have higher numbers of assists than center players, while center players have higher numbers of rebounds as well as greater weight and height.

Besides our three research questions, there were additional questions that have not been answered by this project. An area of further investigation would be fitting several statistical models with predictors variables in the dataset to predict a player’s salary. Another topic of interest would be how weight and height of a player are related to the player’s position on the basketball court.

NBA Active Players in the 2021-2022 Season

Jintong Chang, Angela Chen, Xinhang Yu, Yiming Zhao

2022-04-29