rm(list = ls())
setwd("/Users/angelachen/Desktop/36315/315project")
# Cleanup and Prep
library(tidyverse)
library(GGally)
library(readr)
library(dplyr)
library(ggpubr)
library(factoextra)
nba = read_csv("players.csv")
#players who didnt go to college are marked as "NaN
nba$College[is.na(nba$College)] <- "NaN"
nba = na.omit(nba)
nba$Height = as.numeric(substring(nba$Height,1,1)) + as.numeric(substring(nba$Height,4,4))*0.0833333
nba = nba[,-6]
nba2 = nba[,-c(1,3,7)] #remove categorical variables with > 30 categories
This project explores a dataset of active players of NBA basketball games in the 2020 - 2021 season. Each observation of the dataset represents a player in the games. It contains variables about the player, such as their weight and height, their salary in 2020-2021, and the player’s position. It also has variables describing the performances for each player, such as the average number of points they scored in each game and the average number of times they assisted in scoring a goal.
For our purposes, we cleaned the data set by removing rows with missing values from the original Kaggle dataset. We also transformed some existing variables to better visualize and analyze relationships. The final dataset contains 394 rows and 10 columns representing the different variables, 7 of which are quantitative and 3 are categorical.
The quantitative variables of interest are:
Age
: the player’s age in yearsHeight
: the player’s height in feetWeight
: the player’s weight in poundsSalary
: the player’s salary in USDPoints
: the average number of points the player scored
per gameRebounds
: the average number of rebounds
(i.e. retrievals of the basketball directly after a missed shot) per
gameAssists
: the average number of assists (i.e. passing of
the ball to a teammate in a way that leads directly to a score by field
goal) per gameThe categorical variables of interest are:
Position
: the player’s position on the courtTeam
: the team of the playerCollege
: the college the player went toGiven the data, we were interested in the factors of each player’s success. Frankly, a player’s success can be measure by their salary and the points that they score in each game. Particularly, we wanted to explore three research questions.
First, looking at the performance of players, (1) how do body measurements including weight and height affect the player’s ability to win points, conditioned by the position?
Second, since we are college students with friends who are student atheletes, we were interested in (2) whether there is an association between a player’s salary and the college they went to. Do NBA players who went to top basketball colleges earn more than those who did not? Is going to (a top basketball) college really worth it for basketball players?
Finally, speaking of money, we wanted to look at some other possible predictors of a player’s salary. Specifically, our question was: (3) How do position, team, age, height, weight, and performance affect salary?
We wanted to better investigate whether body measurements influence
the player’s performance on the basketball court, conditioned by the
position, which suggests we should examine Weight
,
Height
, Position
, and Points
. We
plotted these variables using boxplots and histograms.
First, we would like to explore the height and weight of NBA current players by their positions. Since we have one categorical variable and one numerical variable in both cases, a side-by-side boxplot makes the most sense for this comparison. In the below analysis, we have mutated the Position variable from 7 categories (C, F, G, PF, SF, PG, SG) to 3 categories (C, F, G).
nba_yu = nba
nba_yu$Po = as.factor(
ifelse(nba_yu$Position=="G","G",
ifelse( nba_yu$Position=="PG","G",
ifelse(nba_yu$Position=="SG","G",ifelse(nba_yu$Position=="C","C","F")))))
nba_yu %>% group_by(Po) %>% summarize(mn=mean(Points,na.rm=T))
#boxplot
#height by position
ggplot(nba_yu, aes(x=Height))+
geom_boxplot(aes(fill=Po))+
labs( title = "Height by Position")
#weight by position
ggplot(nba_yu, aes(x=Weight))+
geom_boxplot(aes(fill=Po))+
labs( title = "Weight by Position")
We see that there is not a significant amount of variance in player’s height between positions, but there is a rather large variance in weight between positions. Specifically, C (Center) weight the heaviest, whereas PG and G are the lightest.
We then move on to exploring the number of points get by each position with the histogram shown below.
ggplot(nba_yu, aes(x=Points))+
geom_histogram(fill="white", color="black",binwidth=3,
aes(y = after_stat(density)))+
geom_density()+
geom_vline(aes(xintercept=9.87, color="C"),size=1.5) +
geom_vline(aes(xintercept=9.96, color="F"),size=1) +
geom_vline(aes(xintercept=12.0, color="G"),size=1.5) +
scale_color_manual(name = "statistics",
values = c(C = "blue", F = "red",
G="purple"))+
scale_x_continuous(limits = c(0, 20))+
labs( title = "Points by Position")
From the graph above, it is clear that there is no significant difference between C and F in the number of points, but G does get significantly more points according to our data.
We now use position, height and weight to predict the points by each player. We start with the position (with 7 categories) vs height and weight, and see the results below.
#model
model2<-lm(Points ~ Position + Height + Weight, data=nba_yu)
summary(model2)
##
## Call:
## lm(formula = Points ~ Position + Height + Weight, data = nba_yu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1647 -4.6420 -0.9638 3.4497 20.1889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.60919 9.36517 -0.599 0.549564
## PositionF -2.41587 2.39671 -1.008 0.314089
## PositionG 5.00909 2.78443 1.799 0.072807 .
## PositionPF 1.31716 1.15507 1.140 0.254860
## PositionPG 7.13457 1.66032 4.297 2.19e-05 ***
## PositionSF 3.65520 1.25197 2.920 0.003711 **
## PositionSG 5.58976 1.42432 3.925 0.000103 ***
## Height -0.80163 1.21920 -0.658 0.511248
## Weight 0.08268 0.02172 3.806 0.000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.254 on 385 degrees of freedom
## Multiple R-squared: 0.07736, Adjusted R-squared: 0.05819
## F-statistic: 4.035 on 8 and 385 DF, p-value: 0.0001274
We observe that PG, SG are actually both guard positions, but we do have a separate guard position, which is also slightly significant. It is possible that the guard position itself is especially associated with higher points, which is also what we’ve seen in the EDA histogram. Thus, we combine these three positions into a new indicator variable G, representing if a person is Guard (PG, SG, G) or not and obtain the regression results below. We keep the Weight and Height variables as is.
nba_yu$G = as.factor(ifelse(nba_yu$Position=="G",1,ifelse(
nba_yu$Position=="PG",1,ifelse(nba_yu$Position=="SG",1,0))))
model21<-lm(Points ~ G + Height + Weight, data=nba_yu)
summary(model21)
##
## Call:
## lm(formula = Points ~ G + Height + Weight, data = nba_yu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.472 -4.534 -1.263 3.512 20.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.67344 8.50069 0.785 0.43290
## G1 3.29315 0.89820 3.666 0.00028 ***
## Height -1.08678 1.19159 -0.912 0.36231
## Weight 0.04468 0.01753 2.548 0.01121 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.331 on 390 degrees of freedom
## Multiple R-squared: 0.04235, Adjusted R-squared: 0.03498
## F-statistic: 5.748 on 3 and 390 DF, p-value: 0.0007424
As shown in the regression results above, we confirmed the hypothesis that the position guard is indeed related to points that a player scores. Also, even after controlling for the guard position, Weight is still significant at the 0.05 level, regarding the prediction of points scored.
As the game of basketball has been developing rapidly in the past 2 decades, NBA has also been making new rules to adapt to these changes. One of the significant reforms that the league made in 2005 is prohibiting the practice of drafting high school players. The league believed attending college level basketball games and have an college experience can prepare the young prospects better in terms of both on and off the court. But is this policy effective? Are players who have been to U.S. colleges really more successful? So, for the second research question, we are interested in whether education plays a role in a player’s future salary. The goal of this research question is to find out whether the experience of attending U.S. colleges can truly make NBA players more successful in terms of a player’s salary.
nba_college <- nba %>%
group_by(College)%>%
summarise(Count = n())
nba3<-mutate(nba,
college_level=ifelse (College=="NaN","N",
ifelse(College=="Kentucky"|nba$College=="Duke"|nba$College=="UCLA"
|nba$College=="Kansas"|nba$College=="NorthCarolina"
|nba$College=="Texas"|nba$College=="Arizona"
|nba$College=="USC"|nba$College=="Villanova"
|nba$College=="Washington","T","U")))
First, we construct a new variable college level
. From
the dataset, we find the top 10 colleges in terms of the number of NBA
players that the school has produced. Then, we label the players who
attended these colleges to have a college level
of
T
(top basketball colleges). The other players who attended
other colleges will be labeled to have a college level
of
U
(average colleges). At last, for those who didn’t attend
college in the U.S., we label them as N
(never attended).
Note that many players who belong to the N
category are not
from U.S. at all. They may have attended college in their home country.
In our case, we are trying to assess the effect of attending a U.S.
college, so we put everyone who didn’t attend college in the U.S. under
the label N
.
For this research question, we are going to examine whether this
newly created variable college level
has an effect on the
mean and distribution of player’s salaries.
ggplot(nba3,aes(x=college_level))+
geom_bar(aes(fill=college_level))+
labs(x="Collge Level",
y="count",
title = "College levels of players")+
scale_fill_discrete(labels = c("Never attended", "Top college", "college"))
First, we construct a barplot of the counts of different
college level
categories to assess the distribution of
college_level
and have a better understanding of our
variable in interest. From the plot, we can see players who never
attended any U.S. colleges have the lowest count (55), players who
attended average U.S. colleges have the largest count (221), and players
who attended top basketball colleges fall in the middle (118).
ggplot(nba3,aes(x=Salary))+
geom_density(aes(color=college_level))+
geom_vline(xintercept = 11169364, color="green",linetype="dotted")+
geom_vline(xintercept = 8438443, color="blue",linetype="dotted")+
geom_vline(xintercept = 9744256, color="red",linetype="dotted")+
labs(title = "Salary by College Level")+
scale_color_discrete(labels = c("Never attended", "Top University", "University"))
This second plot shows the distribution of players’ salary, conditioning on their college level. We also added three dotted vertical lines to the plot to represent each group of players’ mean salary. From the plot, we can see that the distribution of salary for players who didn’t attend U.S. colleges and players who attended top basketball colleges are very similar. They both skew to the right and have a single peak near the left end. In contrast, the distribution for players who attended average U.S. colleges have slightly different distribution – it is more skewed to the right and has a much higher peak near the left end, compared to the other two distributions. For the mean salaries, we see that players who attended top basketball colleges have the highest mean, followed by players who didn’t attend any U.S. colleges. Players who attended average U.S. colleges have the lowest mean.
However, we think that we couldn’t make any statistical conclusions just based on observing the above graphs. We need to conduct additional statistical tests to have enough evidence.
t.test(subset(nba3,college_level=="T")$Salary,
subset(nba3,college_level=="U")$Salary)
##
## Welch Two Sample t-test
##
## data: subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "U")$Salary
## t = 2.1975, df = 185.08, p-value = 0.02922
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 279195.5 5182645.8
## sample estimates:
## mean of x mean of y
## 11169364 8438443
t.test(subset(nba3,college_level=="T")$Salary,
subset(nba3,college_level=="N")$Salary)
##
## Welch Two Sample t-test
##
## data: subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "N")$Salary
## t = 0.82226, df = 118.74, p-value = 0.4126
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2006791 4857006
## sample estimates:
## mean of x mean of y
## 11169364 9744256
t.test(subset(nba3,college_level=="N")$Salary,
subset(nba3,college_level=="U")$Salary)
##
## Welch Two Sample t-test
##
## data: subset(nba3, college_level == "N")$Salary and subset(nba3, college_level == "U")$Salary
## t = 0.87008, df = 78.689, p-value = 0.3869
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1681643 4293269
## sample estimates:
## mean of x mean of y
## 9744256 8438443
We used three separate t-tests to test whether the difference in mean salaries across different college level is significant. The null hypotheses for all three tests are “the true difference in means is equal to 0”. From the above results, we can see that only the first test yields a p-value which is significant at the level of 0.05. Therefore, for the difference in means, what we are able to conclude is: there is enough evidence to suggest that the difference in mean salary for players who attended a top basketball college and players who attended a average U.S. college is not zero. For the other two tests, we fail to reject the null hypotheses.
ks.test(subset(nba3,college_level=="T")$Salary,
subset(nba3,college_level=="U")$Salary)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "U")$Salary
## D = 0.14707, p-value = 0.0816
## alternative hypothesis: two-sided
ks.test(subset(nba3,college_level=="T")$Salary,
subset(nba3,college_level=="N")$Salary)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: subset(nba3, college_level == "T")$Salary and subset(nba3, college_level == "N")$Salary
## D = 0.092744, p-value = 0.9116
## alternative hypothesis: two-sided
ks.test(subset(nba3,college_level=="N")$Salary,
subset(nba3,college_level=="U")$Salary)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: subset(nba3, college_level == "N")$Salary and subset(nba3, college_level == "U")$Salary
## D = 0.1747, p-value = 0.1331
## alternative hypothesis: two-sided
Next, we used three separate KS two-sample tests to test whether the distributions of players’ salary condition on college level are the same. The null hypotheses for all three tests are “the distributions of the two groups of players’ salary are the same”. According to the test results, none of the three tests yield a significant p-vale on the level of 0.05. This means we cannot reject any of the three null hypotheses. We don’t have enough evidence to suggest that the conditional distributions of players’ salary based on college level are different with each other.
In conclusion, we don’t think that education is playing an important factor on a player’s success in terms of salary paid. The conditional distributions of salary based on college level are not statistically significant, and the difference in means are not statistically significant either, with the exception of players who attended top basketball colleges vs players who attended average colleges. Our findings suggest that if a player attended a U.S. college, then whether the college has a great basketball program may make a difference on how much the play will earn in the NBA. But otherwise, attending U.S. colleges doesn’t seem to have a huge effect on players’ success. One reason could be that NBA has became more and more international in the recent years. Scouts and managers are paying more and more attention on the overseas players and have faith in their abilities to play in the current league.
One limitation we have for this research question was that we don’t
have the nationality of players as a variable in this dataset. In future
work, if we could amend our dataset with more columns such as players’
nationality and the colleges that these international players attended,
we could examine the effect of education level on players’ success in a
more accurate and more comprehensive way. Additionally, we could make
improvements on the way we construct the college_level
variable. A more universally agreed/modified method of identifying top
basketball colleges could change the distribution of
college_level
and yields different results for us.
We also wanted to investigate whether the other variables affect the salary of the player. To understand to what extent is the players’ salary affected by other variables, we start with this scatter plot displaying the three major performance variables: points, rebounds, and assists. Since some players can play at multiple positions, we merge the positions into three types for simplicity: center, forward, and guard.
nba1 = nba
nba$Position %>% replace(., .=="PG"|.=="SG", "G") %>%
replace(., .=="PF"|.=="SF", "F") %>% factor() -> nba1$Position
m_1 = lm(Salary ~ Points + Points * Position, data = nba1)
m_2 = lm(Salary ~ Rebounds + Rebounds * Position, data = nba1)
m_3 = lm(Salary ~ Assists + Assists * Position, data = nba1)
points <- ggplot(data = nba1, aes(x = Points, y = Salary, color = Position)) +
geom_point(alpha = 0.6) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
labs(title = "Salary and Points")
rebounds <- ggplot(data = nba1, aes(x = Rebounds, y = Salary, color = Position)) +
geom_point(alpha = 0.6) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE)+
labs( title = "Salary and Rebounds")
assists <- ggplot(data = nba1, aes(x = Assists, y = Salary, color = Position)) +
geom_point(alpha = 0.6) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE)
ggarrange(points, rebounds, assists, ncol = 1)+
labs( title = "Salary and Assists")
It’s obvious that all three variables have positive relationships with salary. Besides, the slope between points and salary are almost the same for all positions, while in the plot of rebounds, guards have a higher slope whereas center players have a lower slope. For assists, guards have a lower slope than the other two positions.
summary(m_1)
##
## Call:
## lm(formula = Salary ~ Points + Points * Position, data = nba1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19516616 -3551919 -38982 2840190 23872957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3400771 1536149 -2.214 0.0274 *
## Points 1240332 134196 9.243 <2e-16 ***
## PositionF -602134 1831166 -0.329 0.7425
## PositionG -432813 1845279 -0.235 0.8147
## Points:PositionF 28295 159002 0.178 0.8589
## Points:PositionG -62105 153360 -0.405 0.6857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6518000 on 388 degrees of freedom
## Multiple R-squared: 0.5927, Adjusted R-squared: 0.5874
## F-statistic: 112.9 on 5 and 388 DF, p-value: < 2.2e-16
summary(m_2)
##
## Call:
## lm(formula = Salary ~ Rebounds + Rebounds * Position, data = nba1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19313036 -4934994 -1010494 3090134 34211728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2960022 2264359 -1.307 0.19191
## Rebounds 1819569 315050 5.775 1.57e-08 ***
## PositionF -1847052 2803623 -0.659 0.51041
## PositionG 794100 2656506 0.299 0.76516
## Rebounds:PositionF 1360537 477842 2.847 0.00464 **
## Rebounds:PositionG 2013254 494105 4.075 5.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8180000 on 388 degrees of freedom
## Multiple R-squared: 0.3585, Adjusted R-squared: 0.3502
## F-statistic: 43.37 on 5 and 388 DF, p-value: < 2.2e-16
summary(m_3)
##
## Call:
## lm(formula = Salary ~ Assists + Assists * Position, data = nba1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21800713 -4059112 -858652 3460331 27252577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358709 1313314 1.796 0.0733 .
## Assists 4493650 674337 6.664 9.17e-11 ***
## PositionF -2370731 1604964 -1.477 0.1405
## PositionG -2517267 1659505 -1.517 0.1301
## Assists:PositionF 259453 778690 0.333 0.7392
## Assists:PositionG -1271763 722231 -1.761 0.0790 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7389000 on 388 degrees of freedom
## Multiple R-squared: 0.4765, Adjusted R-squared: 0.4698
## F-statistic: 70.63 on 5 and 388 DF, p-value: < 2.2e-16
To determine this, we conducted simple linear regressions on these three variables, and the results indicate that the difference between the slope of rebounds in all three positions and the difference in the slope of assists in guards are significant on at least a 0.1 significance level.
Given this, we can conclude that the the ability of scoring points are similarly important for players in all positions to earn high salary. On the other hand, getting assists are less important in the salary of guard players, and the number of rebounds are more important for guard players , while being less important for center players.
Moving forward, we conducted a PCA analysis to look at the difference between positions among all quantitative variables. We made a biplot of the first two dimensions and color the points by their position.
nba_quant = select(nba1, c(4,5,6,8,9,10,11))
nba_pca <- prcomp(nba_quant, center = T, scale. = T)
fviz_pca_biplot(nba_pca, label = "var",
alpha.ind = 0.25,
alpha.var = 0.75,
habillage = nba1$Position , pointshape = 19)
From this biplot, we can notice that the center of points in all positions are at about the same in the first dimension, but are very different in the second dimension. The guard group has a positive average dim-2 value, the center group has a negative average dim-2 value, and the forward group has its average near zero. For the group of guard players, the value in dimension two is higher than the other two variables. In fact, very few points from center and forward group have a positive value in dimension two. This indicates that guard players may have higher number of assists, lower number of rebounds, lower weight, and lower height. For center players, their negative value in dimension two may indicate greater height and weight, more rebounds, as well as lower assists.
In this project, we have shown that weight is significant in the prediction of points scored for players in all positions. However, guard players do score more points in general. Further, we discovered that education in the U.S. does not play an important factor on a player’s success in terms of salary paid. In other words, going to a top basketball college in the U.S. is not related to making more money through NBA. Finally, we found out that players of different positions differ in body measurements and performance actions on the basketball court. In particular, guard players may have higher numbers of assists than center players, while center players have higher numbers of rebounds as well as greater weight and height.
Besides our three research questions, there were additional questions that have not been answered by this project. An area of further investigation would be fitting several statistical models with predictors variables in the dataset to predict a player’s salary. Another topic of interest would be how weight and height of a player are related to the player’s position on the basketball court.