This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the WTA match data. This dataset contains all WTA matches between 2018 and 2022 (through June 10th), courtesy of Jeff Sackmann’s famous tennis repository. The code chunk at the end shows how this dataset was processed in R
.
Each row of the dataset corresponds to a single WTA match and has the following columns:
tourney_id
: a unique identifier for each tournament, such as 2020-888. The exact formats are borrowed from several different sources, so while the first four characters are always the year, the rest of the ID doesn’t follow a predictable structuretourney_name
: name of the tournamentsurface
: type of court surfacedraw_size
: number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)tourney_level
: see link belowtourney_date
: eight digits, YYYYMMDD, usually the Monday of the tournament week.match_num
: a match-specific identifier. Often starting from 1, sometimes counting down from 300, and sometimes arbitrary.winner_id
: the player identifier for the winner of the matchwinner_seed
: seed of winning playerwinner_entry
: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry, and there are a few others that are occasionally usedwinner_name
: Name of the winning playerwinner_hand
: R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand.winner_ht
: height in centimeters, where availablewinner_ioc
: three-character country codewinner_age
: age, in years, as of the tourney_dateloser_id
: (see the above for winner but now for the losing player)loser_seed
:loser_entry
:loser_name
:loser_hand
:loser_ht
:loser_ioc
:loser_age
:score
: final match scorebest_of
: ‘3’ or ‘5’, indicating the the number of sets for this matchround
: tournament roundminutes
: match length in minutesw_ace
: winner’s number of acesw_df
: winner’s number of doubles faultsw_svpt
: winner’s number of serve pointsw_1stIn
: winner’s number of first serves madew_1stWon
: winner’s number of first-serve points wonw_2ndWon
: winner’s number of second-serve points wonw_SvGms
: winner’s number of serve gamesw_bpSaved
: winner’s number of break points savedw_bpFaced
: winner’s number of break points facedl_ace
: (see the above for winner but now for the losing player)l_df
:l_svpt
:l_1stIn
:l_1stWon
:l_2ndWon
:l_SvGms
:l_bpSaved
:l_bpFaced
:winner_rank
: winner’s WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_datewinner_rank_points
: number of ranking points, where availableloser_rank
:loser_rank_points
:Note that a full glossary of the features available for match data can be found here.
wta_2018_2022_matches <-
map_dfr(c(2018:2022),
function(year) {
read_csv(paste0("https://raw.githubusercontent.com/JeffSackmann/tennis_wta/master/wta_matches_",
year, ".csv")) %>%
mutate(winner_seed = as.character(winner_seed),
loser_seed = as.character(loser_seed))
})
# Save this file:
write_csv(wta_2018_2022_matches,
"data/sports/eda_projects/wta_matches_2018_2022.csv")