Success Related Factors for the Astros and the Phillies in the 2022 World Series

Dataset Description and Research Questions

The presented dataset introduces how each starting member of both teams, the Philadelphia Phillies and the Houston Astros, have played throughout the season in terms of batting. This would provide an insight on which team could score more points during the World Series, possibly indicating a higher percentage of winning. We will be using the following variables:

Team: Which team each player plays for (Houston Astros / Philadelphia Phillies).
Name: The name of the player (Martin, Maldonado, Yuli Gurriel, etc).
Age : The age of each player (35, 38, etc).
Games: How many games played throughout the season (113, 146, etc).
X2B: How many hits resulted in a double (12, 40, etc).
X3B: How many hits resulted in a triple (0, 1, etc).
HR: How many hits resulted in a homerun (15, 8, etc).
RBI: Runner Batted in (45, 53, etc).
BA: Batting Average (0.186, 0.242, etc); equivalent to AVG (average batting).
OBP: On-Base Percentage (0.248, 288, etc).
SLG: Slugging (0.352, 0.360, etc).
OPS: On-Base Plus Slugging (0.600, 0.647, etc).
TB: Total Bases (121, 196, etc).
HBP: How many times a player was hit by Pitch (7, 6, etc).
BB: How many times a player advanced to first base through Base on Balls (7, 6, etc).

Further variable explanations will be provided later as we develop our analysis as they assess much more than simple hits, home runs, etc. For now, it may be helpful to note that advanced batting stats (variables that were abbreviated) such as RBI, BA, OBP, SLG, and OPS, the higher the better.

There were three main research questions we wanted to address with this study. All of them are relate in some way to factors that may have led to overall success. First, what quantitative player statistics seem to be related to winning? Did each player contribute to their team relatively evenly, or were there clear “star players”? Finally, which pitchers brought their teams the most success, and what factors might have contributed to this?

Research Question 1

First we will address quantitative factors that seemed to lead to success. To win, you need to score. The basic fundamentals of baseball requires a team to score well. While there are multiple reasons that may influence an outcome of a game, here, we will be examining batting statistics of the players that have played in the World Series, 2022.

While Batting Averages may be the most universal way to see if a batter is hitting well, it do not tell us everything about productivity. We only take account into total appearances at bat (AB). By analyzing variables that use total appearances at plate (PA), we can get more information on a player’s overall batting performance as we include more data such as if a batter advanced bases through Base on Balls (BB), how many bases a batter advanced by hitting (X2B, X3B, etc), etc.

For example, a player with a 0.300 BA could have successfully hit 3 out of his 10 times at-bat. This is a great average comparing it to the 0.259 mean and 0.258 median of the best two teams of the season as seen earlier. However, in reality, the player may have been hit by the ball or walked to base after receiving four pitches the umpire had called balls, creating more productivity when he appears at plate. Nonetheless, the player still achieved his goal of reaching the base despite of the method and even more, he better than the 0.300 BA. Furthermore, what if his 3 hits had all been doubles rather than having 3 singles? Despite both being a 0.300 BA, we see that the hitting 3 doubles contributes more to scoring points compared to the hitting 3 singles. It is debatable that in modern baseball, singles may have higher value than doubles, triples, or even home runs in certain situations. However, there lies too much anticipation to specific tactics and situations so for this project, we will value the type of hits as the following:

Let $1B$ denote a single, $2B$ as a double, $3B$ as a triple, and $HR$ as a home run. $ 1B < 2B < 3B < HR $

As we discussed above, despite of the method, a batter’s goal should be to get on bases. Batting averages did not provide us enough information as it was calculated through the number of times at bat (AB). For example, a player with a 0.300 BA could have successfully hit 3 out of his 10 times at-bat. This is a great average comparing it to the 0.259 mean and 0.258 median of the best two teams of the season as seen earlier. However, in reality, the player may have been hit by the ball or walked to base after receiving four pitches the umpire had called balls, creating more productivity when he appears at plate. Nonetheless, the player still achieved his goal of reaching the base despite of the method and even more, he better than the 0.300 BA.

OPS, On-Base Plus Slugging, is a variable that accumulates every possible batting situation that leads to productivity. It includes on base percentages and slugging, thus being the most fit variable to be looking at for our project. It is calculated through the following equation:

$ OPS = (AB * (H + BB + HBP) + TB * (AB + BB + SF + HBP)) / (AB * (AB + BB + SF + HBP))$ which is equivalent to $ OPS = OBP + SLG $

From the plots, the pairs associated with BA seemed to have lower correlations compared to other pairs in general. These match what we had previously mentioned in that batting averages were not enough to show overall productivity in batting. Thus, we continue to develop our analysis using OPS as variables associated with OPS aside from BA seem to have high correlations (0.912, 0.975).

We denote a new variable, PAA, that compares the plate appearances of each player to the mean of plate appearances across all 18 batters in the world series, in attempt to draw correlation between win rates and OPS stats.

We denote another new variable, WR, that shows the win rate of a player’s team throughout the season. Instead of using individual win rates, we chose the overall team win rates because of external reasons such as momentum, team spirits, etc. It is highly probable that mental traits could have influenced physical traits and thus, altering how a team performs throughout the season.

Here, we see that both box plots for players with higher and lower plate appearances, relative to the mean of plate appearances, show that the Astros had more productivity when batting throughout the season. Not only are their means higher, but the upper and lower quartiles are all higher than the Phillies. Surprisingly, players with low PA seemed to have lower OPS in general compared to players with high PA. This may be due to many factors such as decreased physicality due to injuries, being subbed out due to bad performances, etc. Having higher PA would mean more consistency and being highly probable of batting as they would have throughout the world series.

## 
##  Pearson's Chi-squared test
## 
## data:  table(bat_data.PAA.above$OPS, bat_data.PAA.above$WR)
## X-squared = 8.9833, df = 9, p-value = 0.4388

## 
##  Pearson's Chi-squared test
## 
## data:  table(bat_data.PAA.below$OPS, bat_data.PAA.below$WR)
## X-squared = 7, df = 6, p-value = 0.3208

We see that for both PAA of higher and lower, the p-values are greater than 0.05. We fail to reject the null hypothesis and conclude that for all players, regardless of high or low plate appearances, we cannot be sure if there is a relationship between OPS and win rate. This may be because either for both teams defense performances tended to have higher value in regards winning or losing a game or our sample size was too small and the test did not have enough power to determine independence.

Finally, we will look at average exit velocity and average distance and compare these between team.

Interestingly, values of average exit velocity and average distance seem to be higher for the Phillies, despite that the Astros ended up winning. This may indicate that these factors of a hit don’t actually determine whether that hit will get the team runs.

Research Question 2

We will now turn our attention to the question of whether the teams seemed to have “star players” or whether each player contributed fairly evenly.

First we will observe look at the average values for a couple different important variables for each team member. Slugging percentage “represents the total number of bases a player records per at-bat. Slugging percentage differs from batting average in that all hits are not valued equally.” ¹

We use the following equation to calculate slugging percentages: $SLG = ((1B) + (2 * 2B) + (3 * 3B) + (4 * HR)) / AB $ which is equivalent to $SLG = TB / AB$

By using slugging percentages, we are able to dive deeper into a batter’s contribution in scoring points compared to batting averages.

When looking at batting average, it appears as though each player contributes fairly equally, particularly on the Phillies. Slugging percentage is a bit more varied. On the Astros, Yordan Alvarez is noteably higher than the other players. The Phillies are, again, more even. Overall these graphs make it seem as though each player contributes relatively evenly.

To more directly measure the contribution each team mate had, we will look at runs and hits. This will directly show the number of points they were able to get for their team.

Unsurprisingly, runs and hits are positively correlated. More interestingly, this provides evidence that each team has a few star players. We can see that Jose Altuve, Yordan Alvarez, and Alex Bregman seem to have contributed the most for the Astros, while Alec Bohm, Rhys Hoskins, and Kyle Schwarber seem to have contributed the most to the Phillies. Overall, the evidence seems to lean in favor of the “star player” hypothesis.

Research Question 3

The dataset contains information on pitching data for the two teams: the Houston Astros and the Philadelphia Phillies. This data is from the 2022 MLB World Series. Each row contains general player information, team, and various pitching statistics for that player. The below variables will be used in analyzing the pitchers for both teams:

Team: The team the player pitches for.
Pos: The type of pitcher the player is.
Name: The name of the player.
W.L.: Winrate of that player.
ERA: Expected run average of the player.
GS: Number of games where this player started pitching.
H9: Number of hits allowed per 9 innings for that player.
BB9: Number of walks per 9 innings for that player.
SO9: Number of strikeouts per 9 innings for that player.
SO.W: Strikeout-to-walk ratio for that player.

One of the main goals of the analyses performed on this pitching data is to answer the question: which pitchers brought their teams the most success, and what factors might have contributed to this?

Here is a density plot for earned run averages (ERAs) for pitchers by team. Earned run averages is the number of runs earned against the pitcher on average over the course of a game. This metric means lower ERA values show a more consistent pitcher. Additionally, the density curves are graphed by position with RP denoting relief pitchers and SP denoting starting pitchers.

The Philadelphia Phillies have a high density of decent relief pitchers having a spike around 3 ERA and most of their starting pitchers have a slightly higher ERA. In contrast, the spread of ERA for Astro pitchers is wider which is better in this case. The Astros happen to have a good density of starting and relief pitchers that have a low ERA with two standouts being Justin Verlander and Ryne Stanek who are starting and relief pitchers respectively. The closing pitcher position denoted CL is not present given only one exists on either team which is too small for a density plot.

Here we see a breakdown of pitchers by team again. This scatterplot shows further emphasis on exploring expected run average for each pitcher. The size of each point is relative to their strikeout-to-walk ratio which is simply a ratio of strikeouts the pitcher gave over the walks they gave. Again, pitcher position is dictated by the color of the point. Additionally, there is a trendline present for each team across all pitchers for W/L rate vs ERA showing a negative correlation between the two, which makes sense.

First, for each point representing players, the size of the point indicates their strikeout-to-walk ratio with higher values being indicative of a strong pitcher who strikes out batters more and do not allow for many walks and a bigger point size. Here we see strong starting pitcher Justin Verlander with a relatively high SO/W of 6.38 and low ERA of1.75 in the Astros. The better pitchers on a given team seem to be above the trendline present. Interestingly the closing pitcher for each team has no more than around 50% win-loss rate.

From this graph it can be seen that most of the Astros’ pitchers generally have a low ERA meaning they let less batters walk while pitching. Verlander has an impressively low 1.75 ERA but the Astros also have another pitcher with a low ERA, the relief pitcher Ryne Stanek. While the Phillies do not have any explicit standout players, their starting pitchers tend to have a higher strikeout-to-walk ratio (SO/W) meaning they do not concede a lot walks to the opposing batters.

Here we see a principal component analysis (PCA) of multiple different quantitative variables with the only two not on a per inning basis being Earned Run Average and Win-Loss rate. The 5 statistics measured in this PCA analysis are win-loss rate, strikeouts per 9 innings, walks per 9 innings, hits per 9 innings, and games the pitcher started (which is an indicator between relief and starting pitchers). The ellipse present shows the clustering of the data, which as we can see shows the Astros as having more pitchers with a higher win-loss rate.

From the data, we can see that number of walks and hits allowed per 9 innings is negatively correlated with winrate but are also negatively correlated with themselves. Strikeouts per 9 innings positively correlated with win-loss rate as expected. Games started is a statistic that measures starting pitchers and how often they start; the positive relationship with win-loss rate and games started makes sense since starting pitchers tend to be better than relief pitchers.

Overall, based on the PCA of the two teams, the Astros have pitchers that are more likely to either start the game or have higher strikeouts per 9 innings than other pitchers. This directly correlates to their pitchers’ win-loss rate as higher W/L rate is affected positively by strikeouts per 9 innings and which games the pitchers started. As we see with the Phillies pitchers, most have relatively more walks and hits per 9 innings in relation to the Astros. This is indicative of the Astros having better pitchers than the Phillies since the displayed variables dictates 70% of the variation within the data and the Astros pitchers tend to have higher win-loss rates than the Phillies.

While it is hard to pinpoint exactly what makes a great pitcher, it can be seen that some of the highest driving factors include having a low ERA, having a high SO/W, and number of games started. These factors directly play a role in the pitchers W/L% throughout the league season. Additionally the statistics of strikeouts, hits allowed, and walks per 9 innings are seen as factors affecting W/L% in the above PCA. Better pitchers are ones that can minimize walks and hits allowed while maximizing SO/W.

Conclusion

We note a few characteristics of our dataset. First of all, we have a very small dataset which may lead us to having low power when testing variables. For example, the two chi squared test we used to check independence between OPS and win rates both had p-values of 1. We could make assumptions of low powered tests throughout the project as we constantly use datasets with low powers. Second, despite having a very small dataset, the percentages used for batting analysis such as OPS and BA came from fairly big sample sizes. Thus, we could reason that such statistics may be reliable. For example, Yordan Alvarez had an OPS of 1.019 (which was the best out of all 18 players) over 561 plate appearances through the season, which in both cases was higher than the median and mean. We are able to make reasonable assumptions that he may be one of the best batters during the world series. Having extra data about player performances of each individual game for both pitchers and batters would have helped us ensure power which would also ensure certainty in many statistical inferences.

Despite the limitaitons of our data, we were able to make a few interesting insights. It seems as though OPS is a useful quantitative factor for predicting the success of a team. We were also able to conclude by looking at slugging percentage, hits, and runs, that the teams tended to have certain players contributing more than others, especially the Astros. Finally, we found that the most successful pitchers tended to allow few hits and walks, have many strikeouts, and have a lower ERA, and that the Astro’s pitchers having these qualities likely helped them succeed.