Introduction

The dataset that our team chose was from Kaggle, and it showed information about the MLB. It specifically contains every game’s batting statistics from all players with plate appearances since 1901. It is important to note that the dataset only accounts for regular season games. It is very vast, and contains 31 columns of information regarding player and game statistics. The amount of information in the data enabled us to explore a series of research questions by creating certain visualizations.

Research Question 1

First, we wanted to explore how baseball game statistics in the MLB have changed over time. Perhaps players have improved over time due to better technology and training methods that could result in better statistics, or perhaps stricter rules have been enforced recently that could result in lower statistics. We seek to understand the change in MLB baseball statistics over time through a graph of moving average of mean batting averages from 1901 to 2021 and a time-series plot of the total number of MLB runs and hits from 2010-2019.

The graph below shows how players’ batting averages improved or weakened from 1901-2021.

This moving averages plot shows how players improve throughout the years and displays general trends of player performance. In general, in the 1920s, it appears like players’ batting averages had a significant increase and in the 1970s there seemed to be a dip. It’s interesting how there isn’t an overall increase in batting averages with a trending up improvement in the sport in general, but maybe that is because batting and pitching skills improve together. However, recently, there have been pitching analytics and pitching coaching strategies put in place which may explain the decline in batting averages.

The second graph below examines the total number of MLB runs and hits from 2010-2019 to see if there are any visible differences.

We originally hypothesized that the total number of MLB runs and hits would increase with time because of technology, improved training methods, and more advanced equipment. From this graph, we see that the number of hits in 2010 and 2019 are about the same, with minor fluctuations in between, disproving our hypothesis. Perhaps defensive and offensive skills are improving at the same time, which could explain the little variation in hits over time. It is especially interesting that in 2018, there was a major dip in the total number of hits. On the other hand, the number of runs is higher in 2019 than in 2010 by about 3,000, providing some support for our hypothesis. This may be due to the reasons we hypothesized. Specifically, the total number of runs increased greatly from 2014 to 2017, decreased in 2018, and increased again in 2019.

Research Question 2

We wanted to further explore how the two current MLB leagues, the American league and the National League, compare to one another. We wanted to investigate whether one division was stronger or weaker than the other throughout the century, and if any differences are noticeable with visualizations. We will therefore examine a series of graphs.

The first graph shows the number of wins and losses for both leagues throughout the decades from the 1900s to the present. It is important to note that a lot of teams no longer play, and these were named as defunct, as we were mainly interested in exploring teams that still currently exist.

The above graph suggests there are no visibly significant differences between the number of wins and losses for both leagues over the decades, so there doesn’t seem to be one league that currently is, or has ever been, performing at a higher level than the other. We do however, observe another interesting pattern from the graph: the number of defunct teams decreases overtime, which is to be expected. Teams that used to play in the 1900s are more likely to not exist in the present as MLB teams change based on population, and other economic factors.

The second graph shows the number of runs scored vs. the number of wins for every team, grouped/colored by league to visualize any differences for the National and American leagues. This would show if there is a given league that is scoring a visibly higher number of runs and wins, and any teams that currently don’t exist were labeled as defunct. We will also include a feature to hover over any given point to see which team that point represents.

Our prediction that teams with a larger number of runs overall also have a larger number of wins is corroborated, as there is a relatively positive linear pattern shown on the scatterplot. In terms of differences between the leagues, we cannot identify a league that is performing visibly better than the other as there is no cluster separation between the American and National points on the graph. We do notice that there is a cluster of points for defunct teams where there is a low number of runs and low number of wins, which suggests that MLB teams have been improving overtime.

This third graph will examine whether or not the number of wins per team from the American league is distinguishable from the National league.

One of our research questions was to compare the American League with the National League and see if we could find evidence that one league was better. This plot here is a dendrogram using the number of wins per team, colored by division. The first thing that draws attention is there is a cluster of teams from the National league on the left side of the plot and a cluster of teams from the American league on the right side of the plot. Then in the middle, there seems to be an even mix of teams from both leagues. What this plot tells us is that there does seem to be some pattern that distinguishes a group of American teams from National teams based on the number of games won by the teams, however, this does not provide us with strong evidence to believe that one league is better than the other.

Research Question 3

Next, we were interested in exploring how coaching strategy affects batting statistics. Specifically, do coaches place their better batters at certain positions to maximize the number of hits and runs scored?

The graph below examines whether there is a relationship between player statistics and their batting order positions.

In this PCA plot using players’ quantitative statistics, it appears the observations are grouped by batting order position. The groups generally appear in decreasing order in PC1, that is, earlier batting positions have higher values of PC1. The positions The groups generally appear in increasing order for PC2, that is, earlier batting positions have lower values of PC2. For all positions, in general, a higher value for PC1 correlates to a higher value for PC2, which gives rise to the positively sloped trend. Because we are able to see groupings in the PCA graph, we can led to believe that there is a relationship between player statistics and their batting order position.

This visualization will help examine the relationship between a player’s quantitative playing stats (runs, hits, etc.) and which position they play.

Another question we were interested in studying based on the data from our dataset, was how do other statistics affect coaching strategy in regards to batting order. Here we have a plot that used PCA to take all the different quantitative columns, reduce those columns down to 2D, and color the points based on what position that player is (battery, infield, or outfield). From this plot, we find that there is no clear grouping of statistics that would help indicate which position best suits a player. If we wanted to be really specific and look for patterns, we could say that players who play a battery position tends to lie more on the left side of the plot, however overall, there is no clear distinction between the different positions. This leads us to believe that there is no relationship between a player’s stats and which position they play.

Conclusion

Overall, we were able to get some pretty interesting insights about baseball and gain some information about the original research questions we set out to answer at the beginning of this project. Our first question looked into if and how baseball game statistics in the MLB have changed over time. In order to help answer this question, we looked at players’ batting averages and the total number of runs and hits over time. Based on our visualizations, we concluded that overall, there is no significant improvement in game statistics throughout the years. However, there are certain time frames where it appeared that players were performing better and times when players seemed to be performing worse. These fluctuations could be explained by some outside factors such as different coaching strategies and improvement in offensive skills.

The second question we dug deeper into was looking if there was a difference between the American League and the National League and if we could find evidence to say one league performs better than the other. The things we looked at were the number of wins and losses per league within the past decade, the total number of wins and runs per team, and the number of wins per team by division. Our conclusion to this questions was that we did not find the leagues to be any different from each other and it appears that they are pretty balanced in terms of the number of games won is each league as well as the number of runs scored in each league.

The last question of interest was how coaching strategy affects batting statistics and how a player’s statistics affects their position in the game. To try answering this question, we used PCA to help reduce each player’s stats down to 2D to make visualization easier, and we looked at these points by batting order position as well as overall position in the game. We concluded that although it appears that a player’s stat does not have any relationship as to what position they play, it did appear to be related to their batting order position.

It was really interesting seeing how the data helped provide answers to our questions, especially when the graphs contradicted our initial hypothesis to the question. Although we did answer our main research questions, this provides a foundation for many more questions to follow and an opportunity to investigate this dataset further.