This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the MLB batted balls data. This dataset contains all batted balls from the current 2022 MLB season, through June 10th, courtesy of baseballsavant.com and accessed using the baseballr
package. The code chunk at the end shows how this dataset was constructed in R
.
Each row of the dataset corresponds to a batted ball and has the following columns:
player_name
: Name of the batter in Last, First formatbatter
: Unique identifier for the batterstand
: handedness of hitter, either L
(left) or R
(right), note that switch hitters are in this table but they’ll switch depending on the pitcher,events
: categorical event denoting the outcome of the batted ball,hc_x
: horizontal coordinate of where a batted ball is first touched by a fielder,hc_y
: vertical coordinate of where a batted ball is first touched by a fielder (note you should take the multiply this by -1 when plotting),hit_distance_sc
: distance of the batted ball in feet according to Statcast,launch_speed
: exit velocity of the ball off the bat (mph),launch_angle
: the vertical angle of the ball off the bat measured from a line parallel to the ground,hit_location
: positional number of the player who fielded the ball, possible values are 1-9 for each player (see here),bb_type
: batted ball type,barrel
: indicator if the batted ball was a “barrel”,pitch_type
: type of pitch thrown according to MLB’s algorithm,release_speed
: speed of the pitch measured when the ball is released (mph),effective_speed
: perceived velocity of the ball, i.e., the velocity of the pitch is adjusted for how close it is to home when it is released (mph)if_fielding_alignment
: type of infield shift by defenseof_fielding_alignment
: type of outfield shift by defensegame_date
: date of the game (mm/dd/yyyy)balls
: number of balls in the count,strikes
: number of strikes in the count,outs_when_up
: number of outs when the batter is up,on_1b
: unique identifier for a runner on first base (if there is one),on_2b
: unique identifier for a runner on second base (if there is one),on_3b
: unique identifier for a runner on third base (if there is one),inning
: the inning number,inning_topbot
: top or bottom of the inning,home_score
: home team score before batted ball,away_score
: away team score before batted ball,post_home_score
: home team score after batted ball,post_away_score
: away team score after batted ball,des
: description of the batted ball and play.Note that a full glossary of the features available from MLB’s Statcast data can be found here.
library(baseballr)
library(tidyverse)
# Scrape all data for this season:
batted_balls_2022 <-
scrape_statcast_savant_batter_all(start_date = "2022-01-01",
end_date = "2022-06-10") %>%
dplyr::filter(type == "X")
batted_balls_2022 <- batted_balls_2022 %>%
# Only select columns regarding the batted ball with discrete pitch type
# information (except for the speed) for now:
dplyr::select(# Batter info:
player_name, batter, stand,
# Batted ball info:
events, hc_x, hc_y, hit_distance_sc, launch_speed, launch_angle,
hit_location, bb_type, barrel,
# Pitch info:
pitch_type, release_speed, effective_speed,
# Shift info:
if_fielding_alignment, of_fielding_alignment,
# Game level context:
game_date, balls, strikes, outs_when_up, on_1b, on_2b, on_3b,
inning, inning_topbot, home_score, away_score, post_home_score,
post_away_score,
# Description of play:
des)
# Save this file:
write_csv(batted_balls_2022,
"data/sports/eda_projects/mlb_batted_balls_2022.csv")