About the Dataset

The dataset we are working with is from Kaggle, and contains observations on NBA basketball games since the first NBA season in 1946-1947. Each observation of the dataset represents a game played between two different teams. It contains variables about the game such as the date of the game, the NBA season number, and the names of the teams that played. It also has variables describing the performances for both home and away teams in the game, such as points scored and field goals attempted and made, and other team metrics.

Looking into the technical aspects of the dataset there are 62015 rows and 150 columns representing the different variables. For our purposes, many of these variables were removed or not used due missing data or containing either very little information or information unrelated to our research questions. To use the data to analyze and visualize relationships to answer our questions we had to clean the dataset in some instances. This was done through subsetting, removing missing values, and creating new variables or transforming existing variables.

For those without much familiarity with the sport of basketball, we will try to explain non-basic concepts we address in our analyses. But to be sure that we also have the basics down, the object of basketball is to score more points than your opponent by shooting the basketball into the basket. The NBA is the league of professional basketball teams in the United States located in major cities all across the country. The teams are categorized into one of two conferences, the Eastern or Western Conference, based on geographic location. Each game played has a home team, the team in whose arena the game is played, and their opponent is the away team (“away” from their home arena). This should provide a foundation for the rest of our discussion below.

Research Questions

Given how large and extensive the dataset is, containing years worth of NBA games, we decided to look into general trends and discussions mentioned by NBA fans and analysts over the years. Specifically, for our reasearch questions, we had three main areas of interest to look into. One, how have NBA games changed over time in terms of score and field goal percentage? Two, can we see any distinct differences in the performance and team metrics between games played at home vs games played away from home? Lastly, there’s always discussion about which conference in the NBA is better between Eastern and Western, so does the data suggest that the conferences are different and if they are, which is better?

Graphs

To begin, we wanted to look at the general keys to victory in a basketball game: scoring points and making baskets. This graphic displays a scatterplot of points scored by the home team on the x-axis against the home team field goal percentage on the y-axis, colored by the outcome of a win or loss for the home team. Field goal percentage is the proportion of successful baskets made out of all attempts. This graph confirms what we suspected: the more points a team scores, the more likely they are to win. Similarly for field goal percentage, the more efficient a team is with their shot attempts, the more likely they are to win the game. Also as should be expected, the graph shows a positive correlation between points scored and field goal percentage, meaning the more efficient a team is at making baskets, the more points they score. Although this chart does not provide any unexpected discoveries, it shows that the perceived fundamental basketball statistics of points scored and baskets made are informative for the outcome of the game. It also shows us the first evidence of a “home field advantage”, the idea that the home team usually performs better than the away team. This is something that we will explore further through our report.

In addition to discussing the differences between home vs away teams, the above graph displays the average score for each year, for both away vs. home teams. We can see that, for one, the home teams appears to consistently score higher than the away teams, on a yearly average basis, which enforces the “home field advantage” idea. We can also see that the peak average score happened around 1960 for both team types, which is something interesting to consider, as there might be a lot of different factors playing a role in this, like there being different players at this time period and perhaps slightly different rules. Although this is the scoring peak, it seems like the average is moving back up in the present time (2020-2021), and seems like it might even get to be higher than the peak eventually. From this scatterplot, we can clearly see the trend of annual average score as the years pass.

This graph shows how the average field goal percentage has changed over time since it began to be tracked in 1984. Once again, there is a clear distinction between the average for the home team compared to the away team. In every year, the home teams collectively have a higher average field goal percentage than away teams. There also appears to be a decreasing trend in field goal percentage from 1984 to about 2004. It is unclear why this is the case looking at this graphic, and it certainly could warrant further exploration through other variables or topics outside of this dataset such as rules changes or macro-strategic shifts in how teams play the game. The average field goal percentage modestly rises again following 2004 where it remains constant through today, though not nearly as high as its peak in 1984.

To further discuss the change in score and field goal percentage, this dendrogram displays the games and three features for both the home and away teams \(-\) the points scored, plus/minus differential (how much offset is made by the players in the game), and field goal percentage \(-\) and the leaf labels are colored by season (summer is excluded here, because the NBA season isn’t during then). We can see the main three clusters that appear, based on the branch colors. Comparing the branch colors to the leaf colors, it is clear that the colors do not match up and so these game features aren’t affected by time of year. This should make sense since these game statistic features relate more to the players’ abilities, and logically these shouldn’t be impacted by what season it is. This can be easily seen in the dendrogram, as the colors are spread out throughout the leaf labels and not clustered similarly with the branches.

This graphic aims to help answer the question of whether different teams perform better than others, specifically teams in the Eastern vs Western conferences as there are always debates about which conference is better. The above conditional density plots show common team game performance metrics (Field Goals, Assists, and Rebounds) for the teams making up each conference and compare the densities. Since multiple teams have entered the league over the years, games from 2005 and on were used to look at a stable league and only home game variables were used to keep things simple. The results of the plots show that while the differences are very small, we can clearly see that Eastern conference teams score less and have less rebounds and assists than Western conference teams. The distributions look to both be pretty normally distributed for each category and we will have to do additional analysis to determine if the small differences are statistically significant.

To continue on examining the differences between Eastern and Western conference teams we decided to run a PCA analysis. In order get the right data for this analysis we decided to use all relevant quantitative variables, which meant cleaning out irrelevant variables like “Video_Available_Home” and others that didn’t carry much information. Below is the plot of the first and 2nd prinicpal compenent found after running PCA. We can see that there’s a lot of overlap between conferences and without any variables listed it is difficult to make any interpretations, so we will move on to a nicer looking visual.

In order to look at all the of the quantitative variables, we performed PCA on the dataset and then plotted PC1 vs PC2 with coloring based on group and used ggbiplot to add the variables and group correlations. We can see a lot of the variables are difficulut to read because of the overlap, but we can tell the home stats are positive with PC1 and the away stats are positive with PC2. Every vairable increases with atleast one PC, i.e there is no variable that increases as PC1 and PC2 decrease. Lastly the correlated circles by group show us that although Eastern and Western conference are close they do differ a bit as Western is higher with PC1.

Conclusions

This dataset provides a wealth of information on nearly every NBA basketball game ever played. We sought to use it to answer a few specific questions about the sport of basketball and the NBA. First, how have NBA games changed over time in terms of score and field goal percentage? Two of our visualizations addressed this question, and the results were very interesting. We found that the average points scored each year have distinct increasing and decreasing trends over time with a current increasing trend over the last 10 years and a peak average score in 1960. The average field goal percentage each year also shows distinct increasing and decreasing trends over time with a current stable trend and a peak field goal percentage around 1984.

Second, can we see any distinct differences between the performance and metrics of home teams vs. away teams? Multiple visualizations helped address this question as well. Our first graphic shows that the winning team, i.e. the team that scores more points, tends to be the home team more often than not. The graphics showing the average points scored and field goal percentages over time also delineate between home and away teams. Together, they show that home teams consistently outperform their opponents in the most important metrics. A principal component analysis shows that home team statistics are positive with PC1 and away team statistics are positive with PC2. All of this points to the validity of the idea of “home court advantage” which posits that all else equal, the home team carries an advantage because of the venue of the game belonging to their team. This is not a statistically significant phenomemon as far as we know, but it will be addressed more fully in our subsequent report.

Third, are there differences between the Eastern and Western Conferences, and is one actually better than the other? We show a set of density plots displaying key game metrics aggregated for each conference, and these plots appear to show that the Western Conference games have more field goals, rebounds, and assists per game than the Eastern Conference games. A principal component analysis also shows that the Eastern and Western Conferences are indeed different in their statistics. Though it might be premature to conclude that the Western Conference is significantly better than the Eastern Conference, the data shows that differences are present and this should warrant further analysis.

We enjoyed sifting through this dataset on NBA game information, and we are excited to continue searching for interesting and important statistical phenomenons in the future.