Introduction

For our 36-315 final project, our group decided to utilize Sean Lahman’s Baseball Database. Lahman’s database can be installed using the command install.packages("Lahman") and can be accessed by running library("Lahman") in the R console.

As our group is full of avid baseball fans, we wanted to use the Lahman database to answer a handful of research questions; our 3 research questions are listed below.

1) How have the outcomes of a baseball game changed over time, with special regard to the three true outcomes?

2) What background factors are significant or highly correlated with producing successful Major League Baseball players?

3) What factors determine a winning MLB team, and how much does it impact revenue and front office budgetary decisions when constructing an MLB team?

Lahman Database

Before we dive into our research questions, we will provide some context on the Lahman Database and the data we are working with. The Lahman Database is compiled by an investigative reporter named Sean Lahman, who works for USA Today. He refers to the database as “an open-source collection of baseball statistics”.

The Lahman Database contains 25+ dataframes, which contain the cumulative pitching, hitting, fielding, teams, and award statistics for MLB from the different eras of baseball spanning 1871-1875. If you would like more information on the database and its different tables, feel free to run the command help(Lahman).

Thus, given the plethora of datasets and different fields to explore in the Lahman database, we chose specific tables to tackle each question. We will go into detail on the dataframes we used for oour specific research questions in their respective sections below.

First Research Question

As stated above, our first research question was How have the outcomes of a baseball game changed over time, with special regard to the three true outcomes? The three true outcome trend signifies the change in baseball as the highest probabilities ofat-bat outcomes have become walks, home-runs, and strikeouts over time. We attempted to investigate this change through time-series analysis.

For this research question, investigating the trends of at-bat outcomes, we pulled our data from the Batting table. The Batting table consists of batting statistics, with each row in the table representing a single player given a certain year’s different hitting metrics. The important variables we utilized are the following:

We used the variables above to aggregate league-wide trends through a time-series analysis to investigate the changes in the outcomes in baseball.

You can access the Batting table after loading in the Lahman Database as such below. We display the first five rows of Batting below.

##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
##   IBB HBP SH SF GIDP
## 1  NA  NA NA NA    0
## 2  NA  NA NA NA    0
## 3  NA  NA NA NA    1
## 4  NA  NA NA NA    0
## 5  NA  NA NA NA    0

We performed filtering such that our data consists of statistics from the year 2000 and on, as this three true outcome trend has been a recent development over the past couple of decades. We decided to perform preprocessing to get league-wide aggregates for our desired statistics. Furthermore, we decided to group by different leagues across the different years.

We display the first five rows of our newly constructed dataframe with the outcome rates below.

## # A tibble: 5 x 14
## # Groups:   yearID [3]
##   yearID lgID     hr     sb    ab    xb     h    so    bb  hrate xbrate   avg
##    <int> <fct> <int>  <dbl> <int> <int> <int> <int> <int>  <dbl>  <dbl> <dbl>
## 1   2000 AL     2688 0.0165 78547  4689 21652 14012  8502 0.0342 0.0597 0.276
## 2   2000 NL     3005 0.0183 88743  5164 23594 17344  9735 0.0339 0.0582 0.266
## 3   2001 AL     2506 0.0211 78134  4640 20852 14496  7239 0.0321 0.0594 0.267
## 4   2001 NL     2952 0.0165 88100  5101 23027 17908  8567 0.0335 0.0579 0.261
## 5   2002 AL     2464 0.0159 77788  4651 20519 14233  7325 0.0317 0.0598 0.264
## # … with 2 more variables: krate <dbl>, walkrate <dbl>

From our newly constructed statistics, we made the following plots below.

Plot 1

Our first plot displays the historical walk and strikeout rates from the last 20 MLB seasons. We have a geom_line using the loess method on our ggplot for both walk and strikeout rates, and we see a stark increase in the number of strikeouts over the past 20 years and a slight increase in walks over the past 5 years. This graph shows that the trend of the true three outcomes is increasing and the everchanging nature of baseball outcomes with different eras.

Plot 2

Our second plot displays the historical home run and extra base hit rates from the last 20 MLB seasons. We again have a geom_line using the loess method on our ggplot for both home run and extra base hit rates. We see that there is a deep increase in the home run rate and a decrease in extra base hit rate in progressing years. This graph further adds to our three true outcome hypothesis; we see that there is an intuitively inverse relationship between extra base hits and home runs. As batters are emphasizing on hitting hard hit balls over the wall rather than throughout the field, this has led to less balls being in play, with more home runs.

Plot 3

For our final plot regarding our first research question, we decided to plot a time-series graph of the historical trend in MLB-wide average, split and colored across the different leagues. This scatterplot with loess trend lines shows that the average rate, which is the rate at which batters attain a hit from an at-bat, has progressivle decreased over the past 20 years. We also see that there is a difference across the two leagues; as the AL (American League) has a designated hitter bat for the pitcher, this leads to the inflation in average for the AL trend line. This graph shows again that the ball is being put in play less in the MLB, and moving towards the three true outcomes.

Second Research Question

Our second research question was What background factors are significant or highly correlated with producing successful Major League Baseball players? As we are given biographical data consisting of player’s birthdays, hometowns, and home countries, we wanted to see if there were any consistent trends in producing MLB players.

For this research question, we pulled from the the People table. The People table consists of biographical data for every player to play in the MLB. We used the following variables from the People table to conduct our analysis.

Here is a look into the first five rows of the People dataframe.

##    playerID birthYear birthMonth birthDay birthCountry birthState  birthCity
## 1 aardsda01      1981         12       27          USA         CO     Denver
## 2 aaronha01      1934          2        5          USA         AL     Mobile
## 3 aaronto01      1939          8        5          USA         AL     Mobile
## 4  aasedo01      1954          9        8          USA         CA     Orange
## 5  abadan01      1972          8       25          USA         FL Palm Beach
##   deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1        NA         NA       NA         <NA>       <NA>      <NA>     David
## 2      2021          1       22          USA         GA   Atlanta      Hank
## 3      1984          8       16          USA         GA   Atlanta    Tommie
## 4        NA         NA       NA         <NA>       <NA>      <NA>       Don
## 5        NA         NA       NA         <NA>       <NA>      <NA>      Andy
##   nameLast      nameGiven weight height bats throws      debut  finalGame
## 1  Aardsma    David Allan    215     75    R      R 2004-04-06 2015-08-23
## 2    Aaron    Henry Louis    180     72    R      R 1954-04-13 1976-10-03
## 3    Aaron     Tommie Lee    190     75    R      R 1962-04-10 1971-09-26
## 4     Aase Donald William    190     75    R      R 1977-07-26 1990-10-03
## 5     Abad  Fausto Andres    184     73    L      L 2001-09-10 2006-04-13
##    retroID   bbrefID  deathDate  birthDate
## 1 aardd001 aardsda01       <NA> 1981-12-27
## 2 aaroh101 aaronha01 2021-01-22 1934-02-05
## 3 aarot101 aaronto01 1984-08-16 1939-08-05
## 4 aased001  aasedo01       <NA> 1954-09-08
## 5 abada001  abadan01       <NA> 1972-08-25

For this research question, we employed the use of heat maps for the United States. We use map_data("state") to get the longitudinal and latitudinal values to create our heat map across different states. We display this the first 5 rows of this dataframe below. We also load in the usdata library in order to conver the state abbreviations to match the region state names in the dataframe below.

##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>

Plot 1

For our first plot, we made a bar plot displaying the frequencies of the top 8 countries that produce the most MLB players. From this graph, we get an understanding of the global scene on baseball players; most MLB players are from the USA, followed by countries in Latin America.

Plot 2

As we saw that most players are from the United States, we want to get an understanding of which regions of the United States produce the most players. Thus, we create a heat map of the United States, which the color of each state representing the frequency of players produced by the respective state. We see that the majority of players come from the state of California, with a little more production of players in the Northeast, Midwest, and Southeast.

Plot 3

For our third plot, we created a heat map based on the number of Hall of Famers by state. By creating this graph, we will inspect if the states that produce the number of truly successful players is different than the frequency at which they produce all MLB players. In order to do so, we accessed the the HallOfFame table in the Lahman Database. By filtering the HallOfFame table for inducted == Y, we get a dataframe of all the Hall of Fame members.

##    playerID yearID votedBy ballots needed votes inducted category needed_note
## 1  cobbty01   1936   BBWAA     226    170   222        Y   Player        <NA>
## 2  ruthba01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 3 wagneho01   1936   BBWAA     226    170   215        Y   Player        <NA>
## 4 mathech01   1936   BBWAA     226    170   205        Y   Player        <NA>
## 5 johnswa01   1936   BBWAA     226    170   189        Y   Player        <NA>

In order to create our heat map, we merge the inducted Hall of Fame table with the player biographical table by playerID. Then, we follow the same procedure as in Plot 2 to find the number of players by state by joining with our longitudinal and latitudinal table.

In the plot, we see that the heat map loosely follows the heat map from Plot 2; thus, we can observe that the same top states California, New York, and others in the Midwest and Northeast produce the MLB’s top talent.

Third Research Question

Our third research question we posed was What factors determine a winning MLB team, and how much does it impact revenue and front office budgetary decisions when constructing an MLB team?

In order to investigate this question, we utilized the Salaries and Teams tables given to us in the Lahman Database.

From the Salaries table, we find the player salary data for a given year and team. We display the first five rows of the table below.

##   yearID teamID lgID  playerID salary
## 1   1985    ATL   NL barkele01 870000
## 2   1985    ATL   NL bedrost01 550000
## 3   1985    ATL   NL benedbr01 545000
## 4   1985    ATL   NL  campri01 633333
## 5   1985    ATL   NL ceronri01 625000

From the Teams table, we are given the performance statistics and standings of a team given a certain year. From this table, there are 48 variables; however, we will mainly utilize the following variables:

We want to utilize the mentioned variables to calculate the winning percentage and run differential of a team for a given year. Given the study of the Pythagorean Theorem of Baseball (https://www.baseball-reference.com/bullpen/Pythagorean_Theorem_of_Baseball#:~:text=The%20Pythagorean%20Theorem%20of%20Baseball,a%20team’s%20actual%20winning%20percentage.), we want to use run difference as an estimate of predicting winning percentage.

With these statistics, we create our first plot.

Plot 1

We create a scatterplot with a trend line, with x = Run Differential and y = Winning Percentage for all the teams in our dataset from the year 1901. We see a clear positive correlation and relationship between run differential and winning percentage, which suggest that the higher your run differential, it is more likely your winning percentage is higher. We can see that run differential is a strong linear indicator of winning percentage, regardless of league.

Plot 2

Now, we want to analyze the relationship between team salary and a team’s winning percentage, from the point of an MLB front office. We ask ourselves the question: “How much money does it take to build a competent team?”.

In order to get team salary data, we conduct the following processing. We get the total salary for each team in our database, and then join it with our Teams table to get the performance statistics and standings for each team. We filter for teams beginning from the year 1985 to restrict to the modern era of ownership.

Furthermore, we want to adjust for inflation in each decade. Thus, we will make a decade variable that we will facet on in our plot. We also filter to exclude the year 2020, as the database lacks the full data for the season that was changed by the COVID-19 pandemic.

We plot our Winning Percentage vs Salary plot below, facetted on each decade. We see that with each increasing decade, there is greater parity in salary and a more positive relationship with winning percentage and a team’s salary. However, from a front office point of view, we see that although more money typically leads to more success, there are many teams in the 2000s and 2010s decades that had a high winning percentage with a low salary. Thus, keeping a balance is ideal and possible in the world of managing baseball teams!

Conclusion

From our three research questions, we were able to answer and tell a story with the use of data and statistical visualizations. From the Lahman Database, we were able to investigate a time series analysis of the changing outcomes of a baseball game, the geographical hot spots of producing MLB players, and the importance of run differential and team salary on a team’s success. We truly enjoyed working with this database and we hope that you enjoyed our analysis. Thank you and feel free to reach out if you have any questions!