This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the NFL passing plays data. This dataset contains all passing plays from the 2021 NFL regular season accessed using the nflfastR
package (accessed using nflreadr
). The code chunk at the end shows how this dataset was constructed in R
.
Each row of the dataset corresponds to a single passing play (including sacks) and has the following columns:
passer_player_name
: name for the player that attempted the pass,passer_player_id
: unique identifier for the player that attempted the pass,posteam
: abbreviation for the team with possessioncomplete_pass
: indicator denoting whether or not the pass was completedinterception
: indicator denoting whether or not the pass was intercepted by the defense,yards_gained
: yards gained (or lost) by the possessing team, excluding yards gained via fumble recoveries and lateralstouchdown
: indicator denoting if the play resulted in a touchdownpass_location
: categorical location of pass,pass_length
: categorical length of pass,air_yards
: distance in yards perpendicular to the line of scrimmage at where the targeted receiver either caught or didn’t catch the ball,yards_after_catch
: distance in yards perpendicular to the yard line where the receiver made the reception to where the play ended,epa
: expected points added (EPA) by the posteam for the given play,wpa
: win probability added (WPA) for the posteam,shotgun
: indicator for whether or not the play was in shotgun formation,no_huddle
: indicator for whether or not the play was in no_huddle formation,qb_dropback
: indicator for whether or not the QB dropped back on the play (pass attempt, sack, or scrambled),qb_hit
: indicator if the QB was hit on the play,sack
: indicator for if the play ended in a sack,receiver_player_name
: name for the targeted receiver,receiver_player_id
: unique identifier for the receiver that was targeted on the pass,defteam
: abbreviation for the team on defense,posteam_type
: indicating whether the posteam team is home or awayplay_id
: unique identifier for a single play,yardline_100
: distance in the number of yards from the opponent’s endzone for the posteam,side_of_field
: abbreviation for which team’s side of the field the team with possession is currently on,down
: down for the given play,qtr
: quarter of the game (5 is overtime),play_clock
: time on the playclock when the ball was snapped,half_seconds_remaining
: seconds remaining in the half,game_half
: indicating which half the play is in,game_id
: ten digit identifier for NFL game,home_team
: abbreviation for the home team,away_team
: abbreviation for the away team,home_score
: total points scored by the home team,away_score
: total points scored by the away team,desc
: detailed description for the given play.Note that a full glossary of the features available for NFL play-by-play data can be found here.
# Load all regular season passes from the 2021 regular season:
library(nflreadr)
nfl_2021_data <- nflreadr::load_pbp(2021, file_type = "rds")
nfl_passing_plays <- nfl_2021_data %>%
filter(play_type == "pass", season_type == "REG",
!is.na(epa), !is.na(posteam), posteam != "") %>%
select(# Player info attempting the pass:
passer_player_name, passer_player_id, posteam,
# Info about the pass:
complete_pass, interception, yards_gained, touchdown,
pass_location, pass_length, air_yards, yards_after_catch, epa, wpa,
shotgun, no_huddle, qb_dropback, qb_hit, sack,
# Context about the receiver:
receiver_player_name, receiver_player_id ,
# Team context:
posteam, defteam, posteam_type,
# Play and game context:
play_id, yardline_100, side_of_field, down, qtr, play_clock,
half_seconds_remaining, game_half, game_id,
home_team, away_team, home_score, away_score,
# Description of play
desc)
# Save this file:
write_csv(nfl_passing_plays,
"data/sports/eda_projects/nfl_passing_plays_2021.csv")