Overview

This project will begin on Friday, June 5 and conclude with a 15 minute presentation one week later on Friday, June 12. Students will be paired into groups of two and randomly assigned one of 8 sports datasets. The goal of this project is to practice understanding the data structure of a dataset, generating hypothesis and using exploratory data analysis and data visualization to attempt to answer these hypothesis.

Deliverables

Each team is expected to produce slides to accompany their 15 minute presentation with the following information:

Timeline

There will be two submission deadlines:

Tuesday, June 9 @ 4:00pm EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.

Friday, June 12 @ 12:00pm EST - Slides and full code must be completed and ready for presentation. Send your slides to Nick’s email . All code and visualizations must be done in R, but the slides may be created in any program.

Data

EDA projects data overview

There are eight different datasets for the EDA projects:

MLB batted ball data

Each row in the top_hitters_2019_batted_balls.csv dataset corresponds to a single batted ball by one of the top five hitters in the 2019 MLB season: Mike Trout, Christian Yelich, Cody Bellinger, Josh Bell, or Joey Gallo. Each batted ball contains the following information:

  • hit_x, hit_y - the x,y coordinates for each batted ball,
  • exit_velocity - speed (miles per hour) off the bat that the ball was hit,
  • launch_angle - angle (in degrees) off the bat that the ball was hit where 0 means parallel to the field, so positive means in the air and negative means towards the ground,
  • batted_ball_type - categorical label for type of batted ball, either fly_ball, ground_ball, line_drive, or popup,
  • outcome - categorical event denoting the outcome of the batted ball. The following are positive events from the perspective of the batter Christian Yelich: single, double, triple, home_run, sac_fly, and field_error. The following are negative events from his perspective: douple_play, field_out, fielders_choice_out, force_out, and grounded_into_double_play,
  • pitch_type - two letter abbreviation denoting the type of pitch that was thrown,
  • player - last name of hitter: either trout, yelich, bellinger, bell, or gallo.
  • batter_stand - handedness of hitter, either L (left) or R (right), note that Josh Bell is the only switch hitter in this data (i.e. switches between L and R depending on the pitcher),
  • hit_distance - recorded distance in feet that the ball traveled.

The two letter pitch_type abbreviations represent the following types of pitches that can be summarized by two groups, (1) fastballs:

  • FF - four-seam fastball (most common pitch in baseball),
  • FT - two-seam fastball (more movement than FF),
  • FC - cutter (look up Mariano Rivera),
  • FS, SI, or SF = sinker / split-fingered,

and (2) offspeed pitches:

  • SL - slider,
  • CH - changeup,
  • CB or CU - curveball,
  • KC - knuckle-curve,
  • KN - knuckleball,
  • EP - eephus.

Note that the batted ball data may not contain all of the different pitch types. The list above is just a comprehensive one if someone wants to look at more pitch-level data from the MLB. This data is in the same format as the Trout batted ball data example.

NFL team season summary data

Each row in the nfl_teams_season_summary.csv dataset corresponds to a single NFL team in a single regular season. The column names are organized below by the type of information they contain, with the first set of columns being self-explanatory:

  • meta - team (three letter abbreviation), season, division,
  • season outcomes - points_scored, points_allowed, wins, losses, ties.

The remaining columns correspond to offensive and defensive summaries of the team’s performance in the season separated by pass and run plays. We have the following offensive passing statistics:

  • pass_off_epa_per_att = expected points added (EPA) per pass attempt on offense,
  • pass_off_total_epa = total EPA from passing plays on offense
  • pass_off_total_yards_gained = total yards gained from passing plays on offense,
  • pass_off_yards_gained_per_att = yards per pass attempt on offense.

The remaining columns (e.g. run_off_epa_per_att, pass_def_epa_per_att, etc) convey the same statistics but switching the type of play: either pass_ or run_, and displaying offensive (_off_) versus the team’s defensive (_def_) statistics.

The EPA variables are advanced NFL statistics, conveying how much value a team is adding over the average team in a given situation. It’s on a points scale instead of the typically used yards, because not all yards are created equal in American football (10 yard gain on 3rd and 15 is much less valuable than a 2 yard gain on 4th and 1). For offensive stats the higher the EPA the better, but for defensive stats the lower (more negative) the EPA the better.

WNBA Championship game 5

Each row in the wnba_championship_game_five.csv dataset corresponds to an event taking place during game 5 of the WNBA Finals on October 10, 2019, between the Washington Mystics and Connecticut Sun. Each event contains the following information:

  • period - period of WNBA game in which event took place,
  • clock - time remaining in period, in MM:SS format (M = minutes, S = seconds),
  • was_score, con_score - score for the Washington Mystics (was) and Connecticut Sun (con) respectively,
  • description - a full string description of the event,
  • team - abbreviation (either was or con) of the team with the event (eg the team taking the shot),
  • event - the type of event, either: field_goal_attempt, free_throw_attempt, rebound, foul, or turnover
  • x_loc, y_loc - the x,y coordinates for each event where x_loc denotes the left to right location of the event with respect to the hoop (0, 0) and y_loc denotes the vertical location of the event. All event locations represent approaching the same direction. Free throw attempt coordinates are not correct, and rebound coordinates correlate to the shot location instead of the location of the rebound,
  • field_goal_type - categorical label of shot, either two_pointer or three_pointer,
  • shot_made - binary indicator denoting whether the shot was made (1) or not (0),
  • assisted - binary indicator denoting whether the made field goal was assisted (1) or not (0),
  • shooting_foul - binary indicator denoting whether or not a shooting foul occurred (1) or not (0),
  • distance_from_hoop - distance in feet from hoop that the event took place,
  • shot_angle - the angle from the hoop that the event took place,
  • shot_type - one of jump, layup, or hook

Age of Empires 2: Definitive Edition ranked 1v1 & team game leaderboard

Each row in the aoe2_leaderboard_sample.csv dataset corresponds to a single Aoe2: DE ranked player. Individual players will have up to two total rows if they have a Team and 1v1 ranking. 10 games played per format is the only requirement to earn an offical ranking. The column names are organized below by order in which they appear in the data set.

  • profile_id - unique numeric key for each player account,
  • name - player username,
  • rank - player ranking on either 1v1 or Team leaderboard,
  • rating - player’s current numeric rating. Similar to a chess ELO,
  • country - nation player account is registered under. Uses alpha-2 nation codes. Full list here,
  • games - total number of games played,
  • wins - games won,
  • losses - games lost, includes games dropped,
  • drops - games were a player has disconnected and received an automatic loss,
  • game_type - Either 1v1 or Teams, the leaderboards for each are separate.

Team ratings and 1v1 rating are ranked separetely, and so it is possible to have a higher rating and a lower ranking if comparing 1v1 ranking against Team ranking.

NTT Data IndyCar 2019 season race results

Each row in the indycar_2019.csv dataset corresponds to a single race entry (driver-team-race combination) in the 2019 NTT Data IndyCar Series season. The season consisted of 17 races and was contested between 36 total drivers. Each driver race entry contains the following information:

  • date - the date the race occured,
  • race_id - the name of the particular race,
  • start - what position the driver started in,
  • finish - what position the driver finished the race in,
  • points - championship points earned in the event. Indianapolis 500 & the Grand Prix of Monterey were double points races,
  • drive_id - the name of the driver,
  • team_id - the name of the team entry / team owner,
  • track_type - what type of track the race was run at,
  • chassis_engine_tires - the combination of chassis, engine and tire brand the car was entered with.

Notes on the final two variables:

  • track_type - P refers to oval circuits, S refers to temporary street circuits, and R refers to permanent road courses
  • chassis_engine_tires - all entries use a Dallara chassis and Firestone tires. The only difference is the engine. ‘D/C/F’ refers to Dallara-Chevrolet-Firestone, while ‘D/H/F’ refers to Dallara-Honda-Firestone

NWSL player season statistics 2017-2019

Each row in the nwsl_season_stats.csv dataset corresponds to one field player-season-team statistics. Players who changed teams in the middle of a season will have one row per team played for in that season. The data set contains data for all field players who participated in at least one match from 2017 to 2019. The columns are as follows:

  • player_id - unique numeric key identifying player,
  • name - player first and last name,
  • season - the season which the statistics refer to,
  • team_id - three character abbreviation of the player’s team,
  • nation - three character abbreviation for the player’s’ nationality,
  • pos - the players position,
  • matches - matches played,
  • starts - matches in which the player was in the starting 11,
  • minutes - total minutes played,
  • goals - goals scored,
  • assists - assists earned,
  • yellow_cards - yellow cards earned,
  • red_cards - red cards earned.

Players can play multiple positions throughout a season. The abbreviations for the pos variable mean the following:

  • FW - forward (or striker)
  • MF - midfielder
  • DF - defender

T20 women’s cricket career bowling statistics

Each row in the womens_cricket_bowling.csv dataset corresponds to one active players career bowling statistics for T20 cricket. T20, or Twenty20 cricket, restricts innings to a maximum of 20 overs to shorten game length. The column definitions are as follows:

  • player - player name,
  • country - player nationality,
  • start - players first season,
  • end - players most recent season. All players last played in either 2019 or 2020,
  • matches - matches played,
  • innings - total innings bowled,
  • overs - the number of overs bowled. An over consists of six consecutive balls bowled,
  • maidens - the number of maiden overs, which is an over in which the bowler conceded zero runs,
  • runs - the number of runs conceded,
  • wickets - the number of wickets taken,
  • average - the average number of runs conceded per wicket taken,
  • economy - the average number of runs conceded per over,
  • strike_rate - the average number of balls bowled per wicket taken.

NHL 2015-16 Stanley Cup Final game 6 play by play

Each row in the nhl_pit_sj_game6.csv data set corresponds to a single event during game 6 of the 2015-16 Stanley Cup Final, played on June 12, 2016. Events include shots, hits, penalties, turnovers, faceoffs and stoppages. The column definitions are as follows:

  • period - game period,
  • period_time - time within period, counting up from 0:00 to 20:00,
  • event - the class of event the row describes. possible values include: period_end, period_start, goal, shot, blocked_shot, missed_shot, giveaway, takeaway, stoppage, faceoff, hit,
  • team - the team the event corresponds to
  • description - a brief description of the event
  • player_one - this column has different meanings depending on the event, it is as follows:

    • event = goal, shot, blocked_shot, or missed_shot - player_one refers to the shooter of the puck,
    • event = hit - player_one refers to the player who did the hitting,
    • event = takeaway or giveaway - player_one refers to the player who committed the giveaway or takeaway
    • event = faceoff - player_one refers to the player who won the faceoff
    • event = penalty - player_one refers to the player who committed the penalty,
    • event = anything else - player_one is NA
  • player_two - this column has different meanings depending on the event, it is as follows:

    • event = hit - player_two refers to the player who was hit,
    • event = faceoff - player_two refers to the player who lost the faceoff,
    • event = penalty - player_two refers to the player who drew the penalty,
    • event = blocked_shot - player_two refers to the player who blocked the shot,
    • event = anything else - player_two is NA
  • event_type - refers to the type of penalty, type of shot attempt (except for missed shots). Is NA otherwise,
  • x_cord, y_cord - the x,y coordinates for each event where x_cord denotes the left to right location of the event with respect to the center ice (0, 0) and y_cord denotes the vertical location of the event. The net is located at (89, 0), and all event locations represent approaching the same direction,
  • empty_net - TRUE if a goal was scored with the net empty, FALSE if a goal was scored on a goaltender. NA if not a goal event,
  • pit_score - the Pittsburgh Penguins score at the time of the event,
  • sjs_score - the San Jose Sharks score at the time of the event,
  • shot_attempt - 1 if the event constitues an attempted shot on goal, 0 otherwise.

