EDA Project: WTA data

Overview

This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.

Deliverables

Your team is expected to produce R Markdown slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:

Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.

Timeline

There will be two submission deadlines:

Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.

Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!

Data

Your team is assigned the WTA match data. This dataset contains all WTA matches between 2018 and 2022 (through June 10th), courtesy of Jeff Sackmann’s famous tennis repository. The code chunk at the end shows how this dataset was processed in R.

Each row of the dataset corresponds to a single WTA match and has the following columns:

tourney_id: a unique identifier for each tournament, such as 2020-888. The exact formats are borrowed from several different sources, so while the first four characters are always the year, the rest of the ID doesn’t follow a predictable structure
tourney_name: name of the tournament
surface: type of court surface
draw_size: number of players in the draw, often rounded up to the nearest power of 2. (For instance, a tournament with 28 players may be shown as 32.)
tourney_level: see link below
tourney_date: eight digits, YYYYMMDD, usually the Monday of the tournament week.
match_num: a match-specific identifier. Often starting from 1, sometimes counting down from 300, and sometimes arbitrary.
winner_id: the player identifier for the winner of the match
winner_seed: seed of winning player
winner_entry: ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry, and there are a few others that are occasionally used
winner_name: Name of the winning player
winner_hand: R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand.
winner_ht: height in centimeters, where available
winner_ioc: three-character country code
winner_age: age, in years, as of the tourney_date
loser_id: (see the above for winner but now for the losing player)
loser_seed:
loser_entry:
loser_name:
loser_hand:
loser_ht:
loser_ioc:
loser_age:
score: final match score
best_of: ‘3’ or ‘5’, indicating the the number of sets for this match
round: tournament round
minutes: match length in minutes
w_ace: winner’s number of aces
w_df: winner’s number of doubles faults
w_svpt: winner’s number of serve points
w_1stIn: winner’s number of first serves made
w_1stWon: winner’s number of first-serve points won
w_2ndWon: winner’s number of second-serve points won
w_SvGms: winner’s number of serve games
w_bpSaved: winner’s number of break points saved
w_bpFaced: winner’s number of break points faced
l_ace: (see the above for winner but now for the losing player)
l_df:
l_svpt:
l_1stIn:
l_1stWon:
l_2ndWon:
l_SvGms:
l_bpSaved:
l_bpFaced:
winner_rank: winner’s WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date
winner_rank_points: number of ranking points, where available
loser_rank:
loser_rank_points:

Note that a full glossary of the features available for match data can be found here.

Code to build dataset

wta_2018_2022_matches <-
  map_dfr(c(2018:2022),
          function(year) {
            read_csv(paste0("https://raw.githubusercontent.com/JeffSackmann/tennis_wta/master/wta_matches_",
                            year, ".csv")) %>%
              mutate(winner_seed = as.character(winner_seed),
                     loser_seed = as.character(loser_seed))
          })

# Save this file:
write_csv(wta_2018_2022_matches, 
          "data/sports/eda_projects/wta_matches_2018_2022.csv")