For our 36-315 final project, our group decided to utilize Sean Lahman’s Baseball Database. Lahman’s database can be installed using the command install.packages("Lahman")
and can be accessed by running library("Lahman")
in the R console.
As our group is full of avid baseball fans, we wanted to use the Lahman database to answer a handful of research questions; our 3 research questions are listed below.
1) How have the outcomes of a baseball game changed over time, with special regard to the three true outcomes?
2) What background factors are significant or highly correlated with producing successful Major League Baseball players?
3) What factors determine a winning MLB team, and how much does it impact revenue and front office budgetary decisions when constructing an MLB team?
Before we dive into our research questions, we will provide some context on the Lahman Database and the data we are working with. The Lahman Database is compiled by an investigative reporter named Sean Lahman, who works for USA Today. He refers to the database as “an open-source collection of baseball statistics”.
The Lahman Database contains 25+ dataframes, which contain the cumulative pitching, hitting, fielding, teams, and award statistics for MLB from the different eras of baseball spanning 1871-1875. If you would like more information on the database and its different tables, feel free to run the command help(Lahman)
.
Thus, given the plethora of datasets and different fields to explore in the Lahman database, we chose specific tables to tackle each question. We will go into detail on the dataframes we used for oour specific research questions in their respective sections below.
As stated above, our first research question was How have the outcomes of a baseball game changed over time, with special regard to the three true outcomes? The three true outcome trend signifies the change in baseball as the highest probabilities ofat-bat outcomes have become walks, home-runs, and strikeouts over time. We attempted to investigate this change through time-series analysis.
For this research question, investigating the trends of at-bat outcomes, we pulled our data from the Batting table. The Batting table consists of batting statistics, with each row in the table representing a single player given a certain year’s different hitting metrics. The important variables we utilized are the following:
We used the variables above to aggregate league-wide trends through a time-series analysis to investigate the changes in the outcomes in baseball.
You can access the Batting table after loading in the Lahman Database as such below. We display the first five rows of Batting below.
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
We performed filtering such that our data consists of statistics from the year 2000 and on, as this three true outcome trend has been a recent development over the past couple of decades. We decided to perform preprocessing to get league-wide aggregates for our desired statistics. Furthermore, we decided to group by different leagues across the different years.
We display the first five rows of our newly constructed dataframe with the outcome rates below.
## # A tibble: 5 x 14
## # Groups: yearID [3]
## yearID lgID hr sb ab xb h so bb hrate xbrate avg
## <int> <fct> <int> <dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2000 AL 2688 0.0165 78547 4689 21652 14012 8502 0.0342 0.0597 0.276
## 2 2000 NL 3005 0.0183 88743 5164 23594 17344 9735 0.0339 0.0582 0.266
## 3 2001 AL 2506 0.0211 78134 4640 20852 14496 7239 0.0321 0.0594 0.267
## 4 2001 NL 2952 0.0165 88100 5101 23027 17908 8567 0.0335 0.0579 0.261
## 5 2002 AL 2464 0.0159 77788 4651 20519 14233 7325 0.0317 0.0598 0.264
## # … with 2 more variables: krate <dbl>, walkrate <dbl>
From our newly constructed statistics, we made the following plots below.
Our first plot displays the historical walk and strikeout rates from the last 20 MLB seasons. We have a geom_line using the loess
method on our ggplot for both walk and strikeout rates, and we see a stark increase in the number of strikeouts over the past 20 years and a slight increase in walks over the past 5 years. This graph shows that the trend of the true three outcomes is increasing and the everchanging nature of baseball outcomes with different eras.
Our second plot displays the historical home run and extra base hit rates from the last 20 MLB seasons. We again have a geom_line using the loess
method on our ggplot for both home run and extra base hit rates. We see that there is a deep increase in the home run rate and a decrease in extra base hit rate in progressing years. This graph further adds to our three true outcome hypothesis; we see that there is an intuitively inverse relationship between extra base hits and home runs. As batters are emphasizing on hitting hard hit balls over the wall rather than throughout the field, this has led to less balls being in play, with more home runs.
For our final plot regarding our first research question, we decided to plot a time-series graph of the historical trend in MLB-wide average, split and colored across the different leagues. This scatterplot with loess
trend lines shows that the average rate, which is the rate at which batters attain a hit from an at-bat, has progressivle decreased over the past 20 years. We also see that there is a difference across the two leagues; as the AL (American League) has a designated hitter bat for the pitcher, this leads to the inflation in average for the AL trend line. This graph shows again that the ball is being put in play less in the MLB, and moving towards the three true outcomes.
Our second research question was What background factors are significant or highly correlated with producing successful Major League Baseball players? As we are given biographical data consisting of player’s birthdays, hometowns, and home countries, we wanted to see if there were any consistent trends in producing MLB players.
For this research question, we pulled from the the People table. The People table consists of biographical data for every player to play in the MLB. We used the following variables from the People table to conduct our analysis.
Here is a look into the first five rows of the People dataframe.
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 aardsda01 1981 12 27 USA CO Denver
## 2 aaronha01 1934 2 5 USA AL Mobile
## 3 aaronto01 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Beach
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> David
## 2 2021 1 22 USA GA Atlanta Hank
## 3 1984 8 16 USA GA Atlanta Tommie
## 4 NA NA NA <NA> <NA> <NA> Don
## 5 NA NA NA <NA> <NA> <NA> Andy
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Aardsma David Allan 215 75 R R 2004-04-06 2015-08-23
## 2 Aaron Henry Louis 180 72 R R 1954-04-13 1976-10-03
## 3 Aaron Tommie Lee 190 75 R R 1962-04-10 1971-09-26
## 4 Aase Donald William 190 75 R R 1977-07-26 1990-10-03
## 5 Abad Fausto Andres 184 73 L L 2001-09-10 2006-04-13
## retroID bbrefID deathDate birthDate
## 1 aardd001 aardsda01 <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001 aasedo01 <NA> 1954-09-08
## 5 abada001 abadan01 <NA> 1972-08-25
For this research question, we employed the use of heat maps for the United States. We use map_data("state")
to get the longitudinal and latitudinal values to create our heat map across different states. We display this the first 5 rows of this dataframe below. We also load in the usdata
library in order to conver the state abbreviations to match the region state names in the dataframe below.
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
For our first plot, we made a bar plot displaying the frequencies of the top 8 countries that produce the most MLB players. From this graph, we get an understanding of the global scene on baseball players; most MLB players are from the USA, followed by countries in Latin America.
As we saw that most players are from the United States, we want to get an understanding of which regions of the United States produce the most players. Thus, we create a heat map of the United States, which the color of each state representing the frequency of players produced by the respective state. We see that the majority of players come from the state of California, with a little more production of players in the Northeast, Midwest, and Southeast.
For our third plot, we created a heat map based on the number of Hall of Famers by state. By creating this graph, we will inspect if the states that produce the number of truly successful players is different than the frequency at which they produce all MLB players. In order to do so, we accessed the the HallOfFame table in the Lahman Database. By filtering the HallOfFame table for inducted == Y
, we get a dataframe of all the Hall of Fame members.
## playerID yearID votedBy ballots needed votes inducted category needed_note
## 1 cobbty01 1936 BBWAA 226 170 222 Y Player <NA>
## 2 ruthba01 1936 BBWAA 226 170 215 Y Player <NA>
## 3 wagneho01 1936 BBWAA 226 170 215 Y Player <NA>
## 4 mathech01 1936 BBWAA 226 170 205 Y Player <NA>
## 5 johnswa01 1936 BBWAA 226 170 189 Y Player <NA>
In order to create our heat map, we merge the inducted Hall of Fame table with the player biographical table by playerID. Then, we follow the same procedure as in Plot 2 to find the number of players by state by joining with our longitudinal and latitudinal table.
In the plot, we see that the heat map loosely follows the heat map from Plot 2; thus, we can observe that the same top states California, New York, and others in the Midwest and Northeast produce the MLB’s top talent.
Our third research question we posed was What factors determine a winning MLB team, and how much does it impact revenue and front office budgetary decisions when constructing an MLB team?
In order to investigate this question, we utilized the Salaries and Teams tables given to us in the Lahman Database.
From the Salaries table, we find the player salary data for a given year and team. We display the first five rows of the table below.
## yearID teamID lgID playerID salary
## 1 1985 ATL NL barkele01 870000
## 2 1985 ATL NL bedrost01 550000
## 3 1985 ATL NL benedbr01 545000
## 4 1985 ATL NL campri01 633333
## 5 1985 ATL NL ceronri01 625000
From the Teams table, we are given the performance statistics and standings of a team given a certain year. From this table, there are 48 variables; however, we will mainly utilize the following variables:
We want to utilize the mentioned variables to calculate the winning percentage and run differential of a team for a given year. Given the study of the Pythagorean Theorem of Baseball (https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball#:~:text=The%20Pythagorean%20Theorem%20of%20Baseball,a%20team’s%20actual%20winning%20percentage.), we want to use run difference as an estimate of predicting winning percentage.
With these statistics, we create our first plot.
We create a scatterplot with a trend line, with x = Run Differential and y = Winning Percentage for all the teams in our dataset from the year 1901. We see a clear positive correlation and relationship between run differential and winning percentage, which suggest that the higher your run differential, it is more likely your winning percentage is higher. We can see that run differential is a strong linear indicator of winning percentage, regardless of league.
Now, we want to analyze the relationship between team salary and a team’s winning percentage, from the point of an MLB front office. We ask ourselves the question: “How much money does it take to build a competent team?”.
In order to get team salary data, we conduct the following processing. We get the total salary for each team in our database, and then join it with our Teams table to get the performance statistics and standings for each team. We filter for teams beginning from the year 1985 to restrict to the modern era of ownership.
Furthermore, we want to adjust for inflation in each decade. Thus, we will make a decade variable that we will facet on in our plot. We also filter to exclude the year 2020, as the database lacks the full data for the season that was changed by the COVID-19 pandemic.
We plot our Winning Percentage vs Salary plot below, facetted on each decade. We see that with each increasing decade, there is greater parity in salary and a more positive relationship with winning percentage and a team’s salary. However, from a front office point of view, we see that although more money typically leads to more success, there are many teams in the 2000s and 2010s decades that had a high winning percentage with a low salary. Thus, keeping a balance is ideal and possible in the world of managing baseball teams!
From our three research questions, we were able to answer and tell a story with the use of data and statistical visualizations. From the Lahman Database, we were able to investigate a time series analysis of the changing outcomes of a baseball game, the geographical hot spots of producing MLB players, and the importance of run differential and team salary on a team’s success. We truly enjoyed working with this database and we hope that you enjoyed our analysis. Thank you and feel free to reach out if you have any questions!