Overview

This project will begin on Monday June 7th, and conclude with a 10-15 minute presentation one week later on Thursday, June 17th during lab from 2:30 to 4 PM EDT. The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Your team is expected to produce R Markdown slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:

Timeline

There will be two submission deadlines:

Friday, June 11th @ 4:00pm EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.

Thursday, June 17 @ 2:00pm EST - Slides and full code must be completed and ready for presentation. Send your slides to Ronโ€™s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!

Data

Your team is assigned the NHL shot data. This dataset contains all shot attempts from the 2021 NHL playoffs (through June 5th), courtesy of MoneyPuck.com. The code chunk at the end shows how this dataset was processed in R (note the dataset was originally downloaded from the MoneyPuck site.)

Each row of the dataset corresponds to a shot attempt and has the following columns:

Note that a full glossary of the features available for NHL shot data can be found here.

Code to build dataset

# Accessed NHL shots from 2020-2021 season from MoneyPuck.com, but will
# simplify the dataset to be easier to work with. Load the original data, and
# then just filter to the playoff game shots (as of June 5th)
playoff_shot_data <- read_csv("http://www.stat.cmu.edu/cmsac/sure/2021/materials/data/xy_examples/moneypuck_shots_2020.csv") %>%
  dplyr::filter(isPlayoffGame == 1)

# Now only select columns to work with for this task:
playoff_shot_data <- playoff_shot_data %>%
  dplyr::select(# Player info attempting the shot
                shooterPlayerId, shooterName, team, shooterLeftRight, 
                shooterTimeOnIce, shooterTimeOnIceSinceFaceoff,
                # Info about the shot:
                event, location, shotType, shotAngle, shotAnglePlusRebound, 
                shotDistance, shotOnEmptyNet, shotRebound, shotRush, 
                shotWasOnGoal, shotGeneratedRebound, shotGoalieFroze,
                # Adjusted for arena locations
                arenaAdjustedShotDistance, 
                arenaAdjustedXCord, arenaAdjustedYCord,
                # Goalie info:
                goalieIdForShot, goalieNameForShot,
                # Team context
                teamCode, isHomeTeam, homeSkatersOnIce, awaySkatersOnIce,
                # Game context
                game_id, homeTeamCode, awayTeamCode, homeTeamGoals,
                awayTeamGoals, time, period)