Analysis of 2020 MLB Season

library(tidyverse)
library(ggseas)
library(ggmap)

baseball <- read_csv("https://raw.githubusercontent.com/binz777/datasets/main/baseball_train.csv")
parks <- read_csv("https://raw.githubusercontent.com/binz777/datasets/main/baseball_park_dimensions.csv")

set.seed(36315)

# Some data cleaning

baseball$game_date <- as.Date(baseball$game_date, format ="%Y-%m-%d")
baseball$strikes <- as.factor(baseball$strikes)
baseball$balls <- as.factor(baseball$balls)

baseball %>%
  arrange(desc(game_date)) %>%
  select(batter_team) %>%
  unique -> a

baseball$batter_team = fct_relevel(baseball$batter_team, c(a))

baseball %>%
  group_by(pitch_name) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> pitch_name_lvl

baseball$pitch_name <- fct_relevel(baseball$pitch_name, c(pitch_name_lvl$pitch_name))

lon <- c(-112.067413,-84.468239, -76.622368, -71.095764, -87.655800, 
         -84.507103, -81.200053, -104.994865, 87.633698, -83.048134, 
         -95.356209, -94.480682, -117.883438, -118.240784, -80.220352, 
         -87.9716667, -93.2777, -73.926186, -73.846237, -122.201553, 
         -75.167465, -80.006363, -117.157516, -122.3325, -122.389717, 
         -90.193329, -82.653553, -97.082604, -79.39, -77.007996)

lat <- c(33.445564, 33.890781, 39.284176, 42.346268, 41.948463, 39.097458, 
         41.411683, 39.756229, 41.830002, 42.338356, 29.757017, 39.051910, 
         33.800560, 34.073814, 25.778301, 43.0288889, 44.9818, 40.757256,
         40.829659, 37.751637, 39.906216, 40.447105, 32.707535, 47.5914, 
         37.778572, 38.622780, 27.768188, 32.751319, 43.6416, 38.873055)

coordinates <- data.frame(park = 0:29, lon, lat)

Introduction

The baseball_train dataset is a dataset from Kaggle that consists of data on every specific baseball at-bat of the 2020 MLB season. There are 46,244 at-bats during this shortened three month MLB season. There are 25 variables in this dataset such as the number of balls and strikes in the batting count, pitch speed, and if the hit was a home run. The park_dimensions dataset is another dataset that was attached; it consists of data on the park dimensions which include variables such as the name of the park, the distance to each field’s wall, and the height of each field’s wall.

Research Questions

Our overarching research question is: What factors contribute to whether a home run is hit in a specific at-bat?

Using the baseball and park dataset, we want to explore these three main questions:

What factors relating to batters correlate to a higher proportion of home runs?
What factors relating to pitchers correlate to a lower proportion of home runs?
What geographical and facility features are related to a higher proportion of home runs?

Research Question 1

The first research question concerns what variables correlate to a higher proportion of home runs. In the first graph, we look at the variables: game_date, home_team, batter_team, and is_home_run.

We created a time series plot to see if batting in one’s home field would correlate to more or less home runs hit.

baseball %>%
  mutate(home_batting = ifelse(home_team == batter_team, "Yes", "No")) %>%
  group_by(game_date, home_batting) %>%
  summarise(potato = mean(is_home_run)) %>%
  ggplot(aes(x = game_date, y = potato)) + 
  stat_rollapplyr(width = 7, align = "left", aes(color = home_batting)) + 
  labs(title = "Proportion of Home Runs by Game Date",
       x = "Game Date", 
       y = "Proportion of Hits That Are Home Runs",
       color = "Is the Home \nTeam Batting?") +
  theme(plot.title = element_text(hjust = 0.5))

In the time series plot, we see that baseball players on the home team tend to hit a higher proportion of home runs than players on the away team overall. In the regular season, the difference in proportions is quite small; however in the postseason, we notice a substantial increase in the proportion of home runs for the players on the home team. There is usually a higher proportion of home runs hit in the playoffs because, for instance, there are more relief pitchers and better offensive players in the playoffs.

Next, we made a faceted scatter plot to see which range of launch speeds and launch angles were usually home runs. We also plotted where the ball was hit to and whether the batter was left-handed. In this graph, we used the variables: launch_speed, launch_angle, is_batter_lefty, bearing, and is_home_run.

baseball %>%
  sample_n(12000) %>%
  mutate(`Home Run?` = ifelse(is_home_run == 0, "No", "Yes"),
         bl = ifelse(is_batter_lefty == 0, "Righty Batter", "Lefty Batter"),
         b = ifelse(bearing == "center", "Center",
                    ifelse(bearing == "left", "Left",
                           "Right"))) %>%
  ggplot() +
  geom_point(aes(x = launch_angle, y = launch_speed, color = `Home Run?`), alpha = .3) +
  facet_grid(bl ~ b) +
  labs(title = "Home Runs by Launch Angle, Launch Speed, Bearing, and Batting Hand",
       x = "Launch Angle", 
       y = "Launch Speed") +
  theme(plot.title = element_text(hjust = 0.5))

In the above scatter plots, we note that home runs usually have a launch speed around 100-110mph and a launch angle around 25-40 degrees. Additionally, there seems to be much less home runs hit by right-handed batters in the right field and left-handed batters in the left field from the sparseness of the blue points (home runs). This pattern could be attributed to the concept of “pulling” - baseball hitters have a tendency to pull, meaning they usually hit the ball on the early side, and therefore hit the ball to the side of the field from which they bat. Overall, the majority of home runs are hit to center field.

Lastly, for the first research question, we want to explore if certain batting counts are more likely to correlate to home runs than other batting counts. In this side-by-side bar graph we include the variables: balls, strikes, and is_home_run.

baseball %>%
  group_by(strikes, balls) %>%
  summarise(prop_home_run = mean(is_home_run)) %>%
  mutate(p = as.character(round(prop_home_run, 3)),
         tag = str_sub(p, 2, nchar(p))) %>%
  ggplot(aes(x = balls, y = prop_home_run)) +
  geom_col(aes(fill = balls)) +
  facet_grid(~ strikes, switch = "x") + 
  labs(title = "Proportion of Home Runs by Batting Counts",
       x = "Strikes", 
       y = "Proportion of Hits That Are Home Runs",
       fill = "Balls") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank()) +
  geom_text(aes(label = tag), vjust=-0.25)

From the bar graph, we can conclude that players hit more home runs when there is a high number of balls and low number of strikes in the batting count. 15% of the hits are home runs when the batting count was three balls and zero strikes (3-0) and about 8% of the hits are home runs when the batting count was three balls and one strike (3-1), as well as when the batting count was two balls and zero strikes (2-0). Home runs are hit most often during these three batting counts because these counts favor the hitter. When the count is 3-0, managers often give their players the “green light” which means they should swing at the next pitch because the pitcher must throw a strike to stay in the at-bat. We also observe that, for a given number of strikes, where the number of balls increases, the proportion of home runs increases. Also, for a given number of balls, when the number of strikes increases, the proportion of home runs decreases.

From the three graphs that we analyzed above, we can see that playing at home vs. away and the batting count both are associated with the proportion of home runs hit. The launch angle and launch speed can help us determine whether the hit could be a home run.

Research Question 2

The second research question concerns what variables relating to pitchers correlate to a lower proportion of home runs. In the first graph we look at the variables: pitch_name, strikes, balls, and pitch_mph.

We created a faceted histogram to see if the batting count or specific pitch type would correlate to certain pitch speeds.

baseball %>%
  filter(pitch_name %in% c("4-Seam Fastball", "Sinker", "Slider", "Changeup")) %>%
  rename(Strikes = strikes,
         Balls = balls) %>%
  ggplot() +
  geom_histogram(aes(x = pitch_mph, fill = pitch_name), position="identity", alpha = .2) +
  facet_grid(Balls ~ Strikes, labeller = "label_both", scales = "free_y") +
  labs(title = "Pitch Speed by Batting Count and Pitch Name",
       x = "Pitch Speed (in mph)", 
       y = "Frequency",
       fill = "Pitch Name") +
  theme(plot.title = element_text(hjust = 0.5))

In the faceted histogram, using the four most common pitch types, we see that for each combination of strikes, balls, and pitch names, speed follows a unimodal distribution. Batting counts that involve 0 strikes, or counts that involve 3 balls seem to show that the 4-Seam Fastball and Sinker generally have pitch speeds of 90 mph or higher. The Slider or Changeup does not seem to be used as frequently, but its pitch speed is generally between 75 and 90 mph. On the other hand, for counts consisting of all other combinations of batting counts, the pitch speeds for particular pitches stays consistently faster or slower to counts involving 0 strikes or 3 balls. However, the 4 pitches are used more equally for the other 6 combinations (not 0 strikes nor 3 balls). This shows that when a pitcher needs a strike, they typically go with their most reliable pitch that is faster, but in other counts, they like to mix up pitches to keep the batter off-balance. Further, a pitcher will try to stay consistent with pitch speeds, because too slow of a fast pitch (i.e. 4-Seam Fastball) might be easier to hit, or too fast of a slow pitch (i.e. Changeup) might be easier to hit.

Next, we made a faceted time series plot to see if certain pitches out of all pitches were hit as home runs in a certain part of the season. In this graph, we used the variables: pitch_name, and game_date, and is_home_run.

baseball %>%
  mutate(home_batting = ifelse(home_team == batter_team, "Yes", "No")) %>%
  filter(pitch_name %in% c("4-Seam Fastball", "Sinker", "Slider", "Changeup")) %>%
  group_by(game_date) %>%
  mutate(total_thrown_on_day = n()) %>%
  group_by(game_date, pitch_name) %>%
  summarise(`# Home Runs / # Hits` = mean(is_home_run),
            num_thrown = n(),
            `# Pitches / All Pitches` = num_thrown / total_thrown_on_day) %>%
  unique() %>% 
  gather(key, value, c(`# Pitches / All Pitches`, `# Home Runs / # Hits`)) %>%
  ggplot(aes(x = game_date, y = value)) + 
  labs(x = "Game Date", y = "Proportions") +
  stat_rollapplyr(width = 7, align = "left", aes(color = pitch_name, linetype = key)) + 
  labs(title = "Proportion of Home Runs and Pitches by Time",
       linetype = "Key",
       color = "Pitch Name") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(pitch_name ~ key, scales = "free_y", ncol = 2) +
  theme(strip.text.x = element_blank(),
        axis.title.y.right = element_text("Proportion of Pitches"))

In the faceted time series plot, using the four most common pitch types again, we see that the 4-Seam Fastball was thrown out of all pitches largely in the beginning of the season (July) and the postseason (late September - October). A higher proportion of home runs were hit in the postseason when this pitch was thrown more frequently. The Sinker was thrown out of all pitches largely in the postseason. Again, a higher proportion of home runs were hit in the postseason when this pitch was thrown more frequently. The Slider was thrown out of all pitches largely in the middle of the season (late August - late September) and thrown much less frequently in the postseason. Interestingly, though, the higher proportion of home runs were hit in the postseason when this pitch was not thrown as frequently. Finally, the Changeup was thrown out of all pitches largely in the beginning of the season to the middle of the season and thrown much less frequently in the postseason. A higher proportion of home runs were hit in the beginning and middle of the season when this pitch was thrown more frequently. This shows that, generally, when a specific pitch is thrown more frequently by a pitcher, a higher proportion of home runs are hit off of that pitch. When a pitcher does not mix enough pitches up, they become predictable, and a batter looks for that pitch to hit hard.

From the two graphs that we analyzed above, we can see that pitchers typically determine which pitch they will throw based on a specific batting count, yet the overall pitch speed for certain pitches stays relatively the same. The type of pitch thrown, how often it is thrown, and the time in the season is associated with the proportion of home runs hit.

Research Question 3

The third research question looks at certain geographical and facility features that correlate to a higher proportion of home runs. In the first graph, we look at the variables: CF_Dim (center field dimension), is_home_run, and Cover.

We created a map of the U.S. to see if the center field dimension and cover type of specific MLB ballparks correlated to the proportion of home runs.

home_run_by_park <- baseball %>%
  group_by(park) %>%
  summarise(num.homeruns = mean(is_home_run))

parks_with_home_run_info <- inner_join(home_run_by_park, parks, by = "park")

frame <- inner_join(parks_with_home_run_info, coordinates, by = "park")

png <- c(left = -125,
         bottom = 24,
         right = -66,
         top = 50)

map_base <- get_stamenmap(png,
                          maptype = "terrain",
                          zoom = 5)
ggmap(map_base, extent = "device", ylab = "Latitude", xlab = "Longitude") +
  geom_point(aes(x = lon, y = lat, color = CF_Dim, size = num.homeruns, shape = Cover), 
             data = frame) +
  scale_color_gradient2(low = "firebrick4", mid = "white", high = "royalblue4", midpoint = mean(parks$CF_Dim)) +
  labs(title = "Proportion of Home Runs and Center Field Dimension",
       size = "Proportion of Home Runs",
       color = "Center Field Dimension (in ft)") +
  theme(plot.title = element_text(hjust = 0.5))

In the map of the U.S. we see that there are six MLB parks with roof covers and one park that is a dome. The rest are outdoor. The shortest MLB center field dimension is located in Boston, while the longest center field dimension is located in Detroit. Additionally, the highest proportion of home runs were hit in Cincinnati, while the lowest were hit in Miami. Interestingly, most West Coast (primarily California), and New England teams have shorter center field dimensions ranging between 390 and 400 feet and are all outdoor stadiums. These teams also had a consistent relatively higher proportion of home runs hit in their stadium, with two exceptions, Yankee Stadium and Citi Field in NYC that have center field dimensions slightly higher than 400 feet. This shows that parks with short center field dimensions are easier to hit home runs out of, which makes sense, because it is easier to hit a ball out of the ballpark when the fence is closer.

The center field dimension is not the only contributing factor to why certain parks are more home run friendly. Now we will look at the heights of the walls in each field in all the parks. Below, we have a faceted scatter plot with linear fits which includes the variables: CF_W (height of center field wall), LF_W (height of left field wall), RF_W (height of right field wall), and is_home_run.

home_run_by_park <- baseball %>%
  dplyr::group_by(park) %>%
  dplyr::summarise(num.homeruns = mean(is_home_run))
parks_with_home_run_info <- inner_join(home_run_by_park, parks, by = "park")
lon <- c(-112.067413,-84.468239, -76.622368,
         -71.095764, -87.655800, -84.507103, -81.200053, -104.994865,
         87.633698,
         -83.048134, -95.356209, -94.480682, -117.883438, -118.240784, -80.220352, -87.9716667,
         -93.2777, -73.926186, -73.846237, -122.201553, -75.167465, -80.006363, -117.157516,
         -122.3325, -122.389717, -90.193329, -82.653553, -97.082604, -79.39, -77.007996)
lat <- c(33.445564, 33.890781, 39.284176, 42.346268, 41.948463, 39.097458,
         41.411683, 39.756229, 41.830002,
         42.338356, 29.757017, 39.051910, 33.800560, 34.073814, 25.778301, 43.0288889, 44.9818, 40.757256, 40.829659, 37.751637, 39.906216, 40.447105, 32.707535,
         47.5914, 37.778572, 38.622780, 27.768188, 32.751319, 43.6416, 38.873055)
coordinates <- data.frame(park = 0:29, lon, lat)
frame <- inner_join(parks_with_home_run_info, coordinates, by = "park")
frame = frame[,c(2,8,9,10)]
colnames(frame) <- c("num.homeruns","Left Field","Center Field","Right Field")
frame = gather(frame, key="Field_Position", value="Height", 2:4)
ggplot(frame, aes(x=Height, y=num.homeruns)) +
  geom_point(size=2, shape=23) +
geom_smooth(method = lm, se = FALSE, color = "dark blue") + xlab("Height of Field Wall (ft)") +
  ylab("Propotion of Home Runs") +
  labs(title = "Proportion of Home Runs by Height of Each Field's Wall") +
facet_grid(rows = vars(Field_Position))

From the graph, we notice that for all three fields, the slope of the line of best fit is negative. This means that as the height of the wall increases, the proportion of home runs decreases. This makes sense because it is harder to hit a home run over a wall that is taller than a wall that is shorter. We can see that the slope of the line for the left field plot is less negative than the others due to an outlier. That outlier is very famous as it is the tallest wall (37 feet) in the MLB known as the “The Green Monster” which is located at Fenway Park in Boston; it’s a huge accomplishment if a player is able to hit it over “The Green Monster.”

Based on the graphs above, we can conclude that the distance to center field and the height of the walls in the outfield both negatively correlate with the proportion of home runs hit. The center field dimension is associated with the proportion of home runs. The height of the wall is also associated with the proportion of home runs.

Conclusion

Key Takeaways

Above, we analyzed three different research questions looking at factors of batters, factors of pitchers, and geographical and facility features all relating to home runs hit. From examining visualizations and performing statistical analyses, we concluded the following:

Factors of Batters: Playing at home rather than away is associated with the proportion of home runs hit. A batting count in favor of the batter (i.e. more balls than strikes) is associated with the proportion of home runs hit. Finally, the proportion of home runs is associated with launch speed and launch angle.
Factors of Pitchers: Pitchers typically decide what pitch to throw based on the count (i.e. faster pitches when they are behind and a mix of pitches when the count is in their favor). The amount of times a pitch is thrown in comparison to the total number of pitches are associated with the proportion of home runs.
Geographical and Facility Features: Center field dimensions and outdoor fields located in specific parts of the U.S. are associated with the proportion of home runs hit. Center field walls are associated with the proportion of home runs.

Limitations and Future Work

Though we were able to find several important conclusions from our data, the data is based on the three month 2020 MLB season, which took place during Covid. As a result, the season was delayed by a significant margin, which started in July and ended in September, meaning that each team only played 60 regular season games. Typically, each MLB team plays 162 regular season games starting in April and ending in October. Additionally, the postseason had an expanded format, featuring 16 teams, instead of the usual 10. As a result, it may be more beneficial to examine data collected during a normal season.

For future work, we could explore batting average which would be better than the is_home_run variable because, overall, batting average is a better measure of performance than home runs. Another useful performance metric would be if the batter got on base or got out. This would allow us to determine many things such as if certain pitches in the strike zone were better than others.