This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the WNBA shot data. This dataset contains all shot attempts in the 2022 WNBA season (through June 10th) accessed using the wehoop
package. The code chunk at the end shows how this dataset was constructed in R
.
Each row of the dataset corresponds to a single shot attempt and has the following columns:
sequence_number
: number indicating order the shot attempt took place in the gameperiod_display_value
: Name display of periodperiod_number
: Period numberhome_score
: Home team score following shot attemptcoordinate_x
: Horizontal location in feet of shot attempt where the hoop would be located at 25 feetcoordinate_y
: Vertical location in feet of shot attempt with respect to the target hoop (the hoop should be a little in front of 0 but ESPN’s coordinate system is not exact)scoring_play
: Indicator if shot was made or notclock_display_value
: Time displayed on clock moment of shotteam_id
: Unique identifier for team the player is ontype_id
: Identifier for the shot typetype_text
: Text description of the shot typeaway_score
: Away team score following the shot attempttext
: Detailed text description of shot attemptscore_value
: Value of the shot attemptparticipants_0_athlete_id
: Unique identifier for the first person listed in the text
description of the shot attempt (typically the person attempting the shot unless it is blocked)participants_1_athlete_id
: Unique identifier for the potential second person listed in the text
description of the shot attempt (typically the person who assisted the shot attempt if scoring_play == TRUE
)game_id
: Unique identifier for a gameaway_team_id
: Unique identifier for away teamaway_team_name
: Away team nameaway_team_mascot
: Away team mascotaway_team_abbrev
: Abbreviation for away teamaway_team_name_alt
: Alternate name for away teamhome_team_id
: Unique identifier for home teamhome_team_name
: Home team namehome_team_mascot
: Home team mascothome_team_abbrev
: Abbreviation for home teamhome_team_name_alt
: Alternate name for home teamclock_minutes
: Minutes remaining in the period displayed on the clockclock_seconds
: Seconds remaining in the period displayed on the clockhalf
: Game halflag_half
: Previous play’s game half (can ignore)lead_half
: Following play’s game half (can ignore)game_play_number
: number indicating order the shot attempt took place in the game (similar to sequence number it appears…)Note that a full glossary of the features available for the WNBA shot data can be found here.
library(wehoop)
wnba_pbp_data <- load_wnba_pbp(2022)
# Get the shots and clean this data a bit:
wnba_shots_data <- wnba_pbp_data %>%
filter(shooting_play)
# Remove unnecessary columns:
wnba_shots_data <- wnba_shots_data %>%
dplyr::select(-shooting_play, -id, -participants_2_athlete_id,
-type_abbreviation, -season, -season_type,
-home_team_spread, -game_spread, -home_favorite)
# Save this file:
write_csv(wnba_shots_data,
"data/sports/eda_projects/wnba_shots_2022.csv")