This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the NHL shot data. This dataset contains all shot attempts from the 2022 NHL playoffs (through June 10th), courtesy of MoneyPuck.com. The code chunk at the end shows how this dataset was processed in R
(note the dataset was originally downloaded from the MoneyPuck site.)
Each row of the dataset corresponds to a shot attempt and has the following columns:
shooterPlayerId
: The NHL player id of the skater taking the shotshooterName
: The First and Last name of the player taking the shotteam
: The team taking the shotshooterLeftRight
: Whether the shooter is a left or right shotshooterTimeOnIce
: playing time in seconds that have passed since the shooter started their shiftshooterTimeOnIceSinceFaceoff
: minimum of the playing time in seconds since the last faceoff and the playing time that has passed since the shooter started their shiftevent
: Whether the shot was a shot on goal (SHOT), goal, (GOAL), or missed the net (MISS)location
: The zone the shot took place inshotType
: Type of the shotshotAngle
: The angle of the shot in degrees - is a positive number if the shot is from the left side of the ice.shotAnglePlusRebound
: The difference in angle between the previous shot and this shot if this shot is a rebound, is otherwise set to 0shotDistance
: The distance from the net of the shot in feet, net is defined as being at the (89,0) coordinatesshotOnEmptyNet
: Indicator if the shot was on an empty netshotRebound
: Indicator if the shot is a rebound, i.e., if the last event was a shot and within 3 seconds of this shotshotRush
: Indicator if the shot was on a rush, i.e., ff the last event was in another zone and within 4 secondsshotWasOnGoal
: Indicator if the shot was on net - either a goal or a goalie save,shotGeneratedRebound
: Indicator if the shot generated a rebound shot within 3 seconds of the this shot,shotGoalieFroze
: Indicator if the goalie froze the puck within 1 second of the shot,arenaAdjustedShotDistance
: shot distance adjusted for arena recording bias - uses the same methodology as War On Ice proposed by Schuckers and CurroarenaAdjustedXCord
: x coordinate of the arena adjusted shot location - always a positive number,arenaAdjustedYCord
: y coordinate of the arena adjusted shot location,goalieIdForShot
: The NHL player id for the goalie the shot is on,goalieNameForShot
: The First and Last name of the goalie the shot is on,teamCode
: The team code of the shooting team,isHomeTeam
: Indicator if the shooting team is the home team,homeSkatersOnIce
: The number of skaters on the ice for the home team (does not count the goalie)awaySkatersOnIce
: The number of skaters on the ice for the away team (does not count the goalie)game_id
: The NHL Game_id of the game the shot took place inhomeTeamCode
: home team in the gameawayTeamCode
: away team in the gamehomeTeamGoals
: Home team goals before the shot took placeawayTeamGoals
: Away team goals before the shot took placetime
: Seconds into the game of the shotperiod
: period of the gameNote that a full glossary of the features available for NHL shot data can be found here.
# Accessed NHL shots from 2021-2022 season from MoneyPuck.com, but will
# simplify the dataset to be easier to work with. Load the original data, and
# then just filter to the playoff game shots (as of June 11th)
playoff_shot_data <- read_csv("data/sports/xy_examples/nhl_shots_2021.csv") %>%
dplyr::filter(isPlayoffGame == 1)
# Now only select columns to work with for this task:
playoff_shot_data <- playoff_shot_data %>%
dplyr::select(# Player info attempting the shot
shooterPlayerId, shooterName, team, shooterLeftRight,
shooterTimeOnIce, shooterTimeOnIceSinceFaceoff,
# Info about the shot:
event, location, shotType, shotAngle, shotAnglePlusRebound,
shotDistance, shotOnEmptyNet, shotRebound, shotRush,
shotWasOnGoal, shotGeneratedRebound, shotGoalieFroze,
# Adjusted for arena locations
arenaAdjustedShotDistance, arenaAdjustedXCord, arenaAdjustedYCord,
# Goalie info:
goalieIdForShot, goalieNameForShot,
# Team context
teamCode, isHomeTeam, homeSkatersOnIce, awaySkatersOnIce,
# Game context
game_id, homeTeamCode, awayTeamCode, homeTeamGoals, awayTeamGoals,
time, period)
# Save this file:
write_csv(playoff_shot_data,
"data/sports/eda_projects/nhl_playoffs_shots_2022.csv")