Introduction

Per MLB, shifting describes the situational defensive realignment of fielders away from their “traditional” (or for our purposes, standard) starting positions. Given that most recent discussion around shifting has been focused on defensive movement in the infield, we will center our analysis on the infield. Statcast defines infield positioning in three ways:

Standard

1. Standard - The infield is in their traditional positions.

Infield Shift

2. Infield Shift - Three or more infielders are positioned on one side of second base.

Strategic

3. Strategic - A catch-all category for positioning that does not fit either category.

In recent years, the shift has become the topic of heated debate in the baseball world. As use of the shift has increased, the number of people in opposition has grown as well. The counterargument is built on the idea that the shift “steals” hits at a rate that puts hitters at a significant disadvantage, relative to the defense. According to Zachary Rymer of Bleacher Report, between 2015 and 2016, hard hit balls (exit velocity of 95+ MPH) to the pull side resulted in a hit 58 percent of the time. This number is down to 50 percent for 2020 and 2021, with left-handed hitters suffering more than their right-handed counterparts. Shifting trends are observed to be different for right-handed hitters, a phenomenon that will serve as the main motivation for our work.

Tom Tango looks into this in The Psychology of the Infield Shift, highlighting that wOBA is the main differentiator between the shift’s effect on righties versus lefties. Similar to on-base percentage, wOBA expands further to account for how a player reached base. The value for each way of reaching base is determined by how much an event is worth relative to projected runs scored, so some outcomes (doubles) are worth more than others (singles). In Tango’s work, lefties displayed a lower wOBA when shifted against, while wOBA increases for righties, both relative to non-shifted situations. We will also look into BABIP which measures batting average exclusively on balls in play, excluding several batted-ball outcomes that do not involve the influence of the defense, such as homeruns. However, given the right-handed hitter phenomenon, wOBA is the primary statistic of interest for our analysis.

The baseballr Package

For our work, we accessed pitch-by-pitch Statcast data from the baseballR package (Bill Petti). This dataset contained pitch-level information dating back to 2015, when Statcast technology was introduced in all 30 MLB stadiums. The initial dataset had 93 variables per observation; however, during preprocessing, we performed several transformations to prepare the dataset for analysis:

After preprocessing, each observation had 88 variables, such as:

After preprocessing, we also added several additional variables:

In order to promote consistency within our analysis, we decided to observe individual years with assumed independence. Thus, players who had 200 or more plate appearances in multiple years were included as separate player instances. We used 200 as our threshold for at bats because it allows for the calculation of more stable performance statistics for each player.

In order to account for switch hitters, we prefaced that the threshold of 200 or more plate appearances must be observed with a particular batting stance within a single season. If, for example, we observe a particular player with 200 or more at bats as a left-handed batter in addition to 200 or more at bats as a right-handed batter within a single season, this player will be included as two separate observations according to their stance. Given that both the 2020 season (60 games played) and the 2022 season (the season was not completed at the start of the project) have significantly fewer observations using this threshold, we elected to exclude these years from our analysis.

IMPORTANT NOTE: For EDA purposes, we used all available data from 2015-2022, excluding 2020 and 2022. For modeling purposes, this range was reduced to 2019-2021, excluding 2020, as the large quantity of pitch-by-pitch data would be sufficient over this time interval.

EDA

While the MLB has had a steady rise in shifting overall, left-handed batters see the shift in a significantly higher proportion of their plate appearances than that of right-handed batters.

REMOVING GAMES: During initial EDA, we discovered a high proportion of missing shift values in the Spring Training pitch data (~78%), as well as a low proportion of missing shift values in the Regular Season (~0.01%). From this, we chose to exclude the Spring Training data from each year, and all observations with missing shift values in either the infield or outfield alignment column.

OUTFIELD ALIGNMENTS: Within each batter-pitcher matchup, the proportion of observations in which the outfield is shifted is relatively low, regardless of the alignment of the infield. According to MLB.com, both cases we classify as a shifted outfield - Three OF on One Side of 2B and 4th Outfielder - are considered extremely. Strategic shifting, which we group with Standard, only account for 7% of all situations, with the rest falling under the Standard. This, in addition to Dr. Brodie’s advice on which situations exhibit the most prominent shifting influences, led us to focus primarily on infield shifting alignment.

INFIELD-IN: In situations with a runner on third base and less than two outs, the infield may be shifted in. The strategic shift category contains a significant proportion of these scenarios - about 17% each of strategic-standard and strategic-strategic observations (infield-outfield). These situations often inflate offensive outcomes, as moving the infield-in shift promotes variance in run scoring to maximize win probability at the expense of expected runs allowed. To confirm this phenomenon, we looked at the differences in four offensive statistics - wOBA, BABIP, walk rate, and strikeout rate - between infield-in and non-infield-in situations. We used confidence intervals to verify that the differences were statistically significant:

Confidence Intervals

Infield-In

2.5 % 97.5 %
(Intercept) 0.3623651 0.37457

Traditional Alignment

2.5 % 97.5 %
(Intercept) 0.3354222 0.3404923

To adjust for the influence of such situations, we decided to remove observations in which the infield could be shifted in. However, to ensure we were not excluding situations in which there was a runner on third base with less than two outs but the infield was not shifted in, we only removed the observations that fit the aforementioned condition and were marked as “shift”.

Methodology - V1

Prior to modeling, we decided to create a more descriptive value capturing a player’s wOBA:

For our methodology, we focused on the trends of two offensive statistics: wOBA and BABIP. Both were provided at the pitch level, but we only focused on each value at the end of an at-bat, as those instances involved the actual effect of the shift, and at the seasonal level.

Our first instinct was to try fitting a linear model to decribe at bat level wOBA values. To model offensive outcomes at the at-bat level, we used several explanatory variables and interactions:

However, as we assessed the conditions for this model, we realized that at bat level wOBA could not be modeled effectively by a linear model because it is not a continuous variable. Although it is numeric, it is discrete with different values corresponding to different at bat outcomes. In the residual plot below, we observed various linear patterns in the residuals. This prompted us to consider other methods of quantifying at bat outcomes.

Quantifying At-Bat Outcomes

First we considered what we consider to be a successful at bat. We contemplated whether we should consider any method of getting on base or only batted balls. However, because any method of reaching a base is considered a success for the offensive team, we decided that getting on base would be considered a successful at bat outcome. Alternatively, we considered not getting on base to be an unsuccessful at bat outcome. Since we were able to reduce at bat outcomes to be binomial, this led us to the logistic regression.

From here, we created a new binary variable to assess whether the batter reaches base or not, as indicated by the wOBa value of a particular outcome. As mentioned previously, wOBA value is on a continuous scale of 0 to 1. Using this, our new binary on-base variable took a value of 0 if wOBA was equal to 0 (the batter did not get on-base) for a particular outcome, and a value of 1 if wOBA was not equal to 0 (the batter got on-base in some way, excluding reaching base on a fielder’s choice or an error).

Logistic - Binary At-Bat Outcomes

We used a logistic regression model as a way to predict whether an at bat would be successful based on at bat level characteristics, using the binary indicator mentioned above. If the batter successfully made it on base, we interpreted this as our positive outcome; if they got out, this was our negative outcome.

As this model was our follow up to the aforementioned linear model, we included the same variables describing the at bat such as whether there was a shift, the batter’s stance, pitcher handedness, pitch location, etc. The harmonic mean weighted delta wOBA value was used as an indicator of player skill on the season.

To assess the fit of our logistic model to our data, we plotted our predicted probability of a successful at bat versus the observed probability of a successful at bat:

From this plot, we observed that our logistic regression resulted in three distinct groupings. It was evident that for both low and high probabilities of getting on base, our model was underpredicting and for moderate probabilities of getting on base our model was overpredicting.

Initially, we completed further EDA to determine if these groupings were influences by any levels of the categorical variables included in our model. However, none of such variables included in our model had three levels, so we quickly moved on to figuring out what else might be causing these relationships.

Logistic Regression with the Added GAM Model

Additionally, we looked into the breakdown of this predicted versus observed relationship by batter handedness and shift occurrence.

Observing the Residual Patterns by Batter-Handedness and In-Field Alignment

With these modeling approaches, we naively treated the last pitch of each at-bat as a sufficient summary of the entire at-bat. This failed to capture important, within-at-bat information that may influence one’s success (or lack thereof) against the shift, such as the number of certain pitches seen, or where those pitches are thrown throughout the at-bat.

Thus, our initial models yielded largely inconclusive results in the context of our research question, so we moved forward with other options.

Methodology - V2

Modeling

After our V1 methodology yielded inconclusive results, we considered an alternative option moving forward.

First, to best gauge the differences between right-handed and left-handed batters against the shift, we split the data into two subsets: right-handed hitters and left-handed hitters. With this approach, as mentioned above, switch hitters would be treated as separate observations based on their handedness for a particular plate appearance. From here, we chose to use a logistic regression to perform statistical inference on each subset, as a logistic model would allow us to explore the relationship between a dependent, binary variable and one or more independent (or predictor) variables.

We also decided to control for three variables: game year, pitcher handedness, and batter. By controlling for the year, we are able to observe changes in getting on-base independent of the effect of general year-to-year trends. As for the batter, controlling for this variable eliminates the effect of the hitter (ex: hitter tendencies, how good a player is, etc.) on our explanation of the on-base outcome. Controlling for pitcher handedness had a similar effect, limiting the influence of pitcher-batter matchups (ex: right-handed pitcher versus left-handed hitter).

We used the same binary indicator variable as our initial model, where a success is considered getting on-base, and a failure is not getting on-base.

Rather than filtering out all information outside of the final pitch of an at-bat, we used the whole at-bat to create two summary statistics before modeling on at-bat outcomes:

  • Proportion of Pitch Type Hit Per AB: Based on the pitch thrown in the last pitch of the at-bat, this variable is a proportion of the number of times the batter saw that pitch type over total pitches seen.

  • Proportion of Pitch Zones Per AB: Using the strike zone regions of inside, middle, outside, and out of zone, this variable is a proportion of where each pitch is located over total pitches seen.

Using these, we were still able to focus on offensive outcomes while maintaining a sufficient summary of that at-bat.

As for the model, we used several explanatory variables and interactions:

  • The Interaction Between Pitch Type - Fastball, Changeup, and Breaking-Ball - and Fielding Alignment

  • The Interaction Between Pitch Count (Per At-Bat) and Fielding Alignment

  • The Interaction Between Pitch Zone (of the Strikezone) and Fielding Alignment

  • The Proportion of Pitch Types (Per At-Bat)

  • The Proportion of Pitches Per Zone (Per At-Bat)

Similar to our first model, we were only able to observe patterns that allude to the differences in success against the shift, but could not draw definitive conclusions for any explanatory variable.

Drawing upon our modeling results and initial EDA, we decided to operate on the conjecture that the explanation for right-handed hitters success against the shift may involve non-BABIP events, namely home runs and/or walks.

This led us to perform Welch’s T-test to explore non-BABIP related outcomes.

Welch’s T-test

Welch’s T-test is generally applied when we want to asses differences in the means of two populations. Similar to our last logistic model, we have four populations of interest: right-handed batters, shifted right-handed batters, left-handed batters, and shifted left-handed batters.

We ran four t-tests with four similar null hypotheses:

  1. For right-handed batters, the average walk/home run rate when shifted against is equal to the average walk/home run rate when not shifted against.

  2. For right-handed batters, the average walk/home run rate when shifted against is equal to the average walk/home run rate when not shifted against.

  3. For left-handed batters, the average walk/home run rate when shifted against is equal to the average walk/home run rate when not shifted against.

  4. For left-handed batters, the average walk/home run rate when shifted against is equal to the average walk/home run rate when not shifted against.

Results

We expected the explanation for right-handed hitters success against the shift to lie in the average difference in walk rate and/or home run rate when a batter is shifted upon.

However, we observed similar trends in home run rate and walk rate between both groups of hitters.

Discussion

Limitations

We encountered several limitations with our exploratory work:

  • We elected to exclude observations from both 2020 and 2022, as both years did not include a full season’s worth of pitch data, and thus did not sufficiently represent recent trends.

  • The shifting categories are very broad and the differences between shift groupings is not distinct. Similarly, we did not have player positional information. Both of these limited our overall understanding of the specifics of each shift.

  • Due to how we have handled players playing for multiple seasons and switch hitters we have violated independence assumptions. Because of this, we must be careful in assessing our conclusions.

Future Work

We have several ways to build upon our findings:

  • Look to develop visualizations to better convey our results

  • Determine ways to assess player-specific attributes and their association with offensive output

  • Address the confounding influence of season-level player tendencies

    • Such aspects would be interesting to explore on its own, but they could also be controlled for to better understand at-bat level relationships relative to offensive output

Important Note: Extreme shifts have been banned in the MLB for the upcoming 2023 season

  • Moving forward, we can assess how this rule change influences trends in offensive production, as well as how teams adjust their shifting strategies.

Acknowledgements

We would like to thank Dr. Adam Brodie for pitching this project idea to us and for taking the time to meet with us and offer guidance along the way. Also, thank you to Dr. Ron Yurko for directing this program and for all the help and suggestions along the way.

References

Baseball Savant

Bill Petti’s baseballR Package

Glossary of Shifting Terms - MLB