For this project, we looked into Lahman’s History of Baseball dataset which contains historical data from the past 100+ years of Major League Baseball. The Lahman package as a whole contains information on batting, fielding, pitching, etc., but our group chose to focus specifically on batting data from the past one hundred years (1919-2019). A mix of categorical variables (player ID, team ID, league ID) and quantitative variables (hits, home runs, at-bats, etc.) are included in “Batting” for a total of 22 variables overall.
Does offensive performance differ across leagues? Specifically looking at Batting Average, On Base Percentage, and Slugging Percentage for each League.
Are there any clusters among teams or leagues in terms of the offensive statistics we’re interested in? Which teams are similar to each other?
How does offensive performance change across decades?
The EDA here shows the distribution of Total Hits and Total Runs faceted by league. In both plots, controlling for league does not appear to make a large difference in the general increase of both hits and runs as you get close to the present day. Furthermore, the distributions for each league for Hits and Runs are virtually the same.
These graphs visualize the probability distribution of Batting Averages, On Base Percentages, and Slugging percentages across each league throughout the past 100 years of the MLB. The National League (NL) is denoted by the blue curve and the American League (AL) is denoted by the red curve. In regards to batting averages, the shape of the probability distributions appears to be different; the AL’s distribution is more evenly spread across batting average values while the NL experiences a sharp dropoff around a 0.270 batting average. Looking at the on-base percentage, the shape of the curves appears to be similar, but the NL distribution is shifted to the left relative to the AL distribution. Lastly, for slugging percentage, the AL has a wider spread while the NL appears to be concentrated around the lower slugging percentage values (with a peak at approximately 0.270 and a sharp dropoff afterward).
To determine if these distributions are significantly different, we performed a KS test for each. The results of the KS test demonstrate that for each variable (batting average, on-base percentage, and slugging percentage), the p-value is less than .05, so we reject the null hypothesis that they follow the same distribution. Therefore, there is statistically significant evidence that the distribution of offensive performance –measured by batting average, on-base percentage, and slugging percentage– differs across leagues in the MLB.
We were also interested in the connection between offensive performance and divisions, specifically looking for any trends in how they are grouped together.
This dendrogram shows data for seven offensive statistics (AB, R, H, HR, RBI, BB, SO) of MLB players in the last 15 years (2005-2019). The colors in the graph separate teams into their six divisions, with the seventh group consisting of players from old teams that no longer exist. The graph shows a clear distinction between the two leagues, with the three AL divisions clustered on the left and the three NL divisions (and the old team division) on the right. This demonstrates that divisions within the same league are most similar to each other, and less similar to divisions across leagues. It is also interesting to note the difference in height between the ends of the dendrogram and the middle. This greater difference in vertical distance means that the central clusters are more different than the end clusters, compared to the other divisions that lie horizontally closer to the ends.
This MDS plot similarly shows data for the offensive statistics we have been focusing on and the points are colored by divisions. There does not appear to be any clear trends across divisions for MDS1. However, for MDS2, the AL Central and NL Central have much more negative MDS2 values than the other divisions. This pattern could be attributed to geographic location, which would be something interesting to investigate further since most of the teams in the Central divisions are in the Midwest region where there are more variations in weather, temperature, etc. throughout the season.
We were interested in exploring how offensive performance has changed across years as well as between leagues, focusing on total number of home runs, on-base percentage, and batting average as our three metrics for performance.
This time series plot displays the total number of home runs per season from 1919 through 2020, separated by league. We can see that both leagues adhere to the general positive trend of home runs as seasons progress. There does not appear to be a clear winner between the leagues in terms of total home runs, however, there is a distinct stretch of about 25 years between 1975 and 2000 where the American League overtook the National League more than the usual trend.
In contrast, when we compare on-base percentage and batting average between leagues, we can see that the American League starkly overtook the National League in both on-base percentage and batting average starting around the early 1970s. This gap in performance was largest between 1980 and the early 1990s, after which that gap began to diminish; however, for both these metrics, the American League continues to outperform the National League. Another interesting observation is that the trends for on-base percentage and batting average is almost identical for both leagues. We found that this pattern also extended to slugging percentage (graph not included in this report). This implies that the increase in total number of home runs per season is not necessarily associated with a similar increase in offensive performance in other metrics.