Overview

This project will begin on Friday, June 19 and involve a 15 minute presentation one week later on Friday, June 26. There is also a written report with a maximum of 4 pages due by 5:00pm EST on Wednesday, July 1st. Both the presentation and written submission will follow the same structure: starting with a summary of key takeaways and then walking through EDA, the modeling process, results and conclusions. Teams will work in groups of 2, and data sets will be assigned randomly.

The goal of this project is to practice planning, creating, evaluating and interpreting linear regression models in R. Each team will fit a linear model using one of the continuous variables in their provided data set. Models should be tested with out-of-sample prediction methods, and must be interpretable. Making a model nedlesly complicated can lead to overfitting and make results difficult to interpret.

The first slide of the presentation, and first page of the written report, will be for key takeaways. For the presentation this can be in the form of a single graph, bullet points or a statement. For the written report it should be a combination of the three. This is your work in elevator pitch form, meant to grab the audiences attention and give them a reason to be interested. In the working world you may have only a short time to pitch results to superiors, and will have to start with the conclusions before explaining your process.

The rest of the presentation, and second page and beyond in the written report will follow a more traditional data anaylsis report format. The format will be as follows:

Deliverables & Timeline

There will be 3 different deliverable deadlines:

Tuesday, June 23 @ 4:00pm EST - Each student will push an Rmarkdown file with their analysis so far to thier GitHub accounts for review. We will then provide feedback on the code submitted.

Friday, June 26 @ 12:00pm EST - Slides must be completed and ready for presentation. Send your slides to Nick’s email . All code and visualizations must be done in R, but the slides may be created in any program. Presentations will be 15 minutes long with 5 minutes for Q&A.

As a reminder, the presentation should start with a Key Takeaway slide, and then lead into a traditional data analysis report (Introduction, EDA, Modeling, Results, Conclusion)

Wednesday, July 1 @ 5:00pm EST - Each team’s final written report must be emailed to Nick’s email .

As a reminder, the written report will have the ‘model pitch’ on the first page, which is a short executive summary and potentially a key graphic. The second page and beyond of the written file will follow the traditional data analysis report format (Introduction, EDA, Modeling, Results, Conclusion). These last pages should set up and support your key takeaways.

Notes

Although this project will use simple linear regression, feel free to contact any of the instructors if you have questions about potential GLM (poisson, logistic) models using any of the data sets provided.

Data Sets

There are eight different datasets for the regression projects (three of which were generated via the init_regression_project_data.R script):

NBA team season summary data

Each row in the nba_team_season_summary.csv dataset corresponds to a single NBA team in a single regular season dating back to 2003. The column names self-explanatory, but note that the columns ending with *_perc mean the percentage based statistics.

Tennis grand slams data

Each row in the data corresponds to a grand slam match between two players. A variety of summary statistics of the match are reported along with winner and loser information. Variables include:

  • tournament - one of the four grand slams: Australian Open, French Open, US Open, and Wimbledon
  • year
  • winner_name and loser_name
  • winner_rank and loser_rank according to ATP or WTA, respectively at the time of tournament
  • Retirement whether the match ended in a retirement (i.e. one person was unable to finish the match). Logical – TRUE means the match ended in retirement
  • Tour either WTA or ATP
  • round - R128 Round of 128, R64 - Round of 64, R32 - Round of 32, R16 Round of 16, QF Quarter Final, SF Semi Final, and F Final
  • w_* and l_* stands for winner and loser, respectively where the suffix is one of many summary statistics including
  • ave_serve_speed in mph
  • n_aces number of aces
  • n_winners number of winners including aces
  • n_netpt_w number of net points won
  • n_netpt number of net points played
  • n_bp_w number of break points won (to break the opponent)
  • n_bp number of break points (to break the opponent)
  • n_ue number of unforced errors
  • n_sv number of serves
  • n_sv_w number of service points won

MLB Batting Statistics 2010-2019

Each row in the baseball_batting.csv data corresponds to the batting statistics for a single player in a single season between 2010 and 2019. THe first few variables as well as the singles, doubles and triples are self-explanatory, and the other baseball variables mean as follows:

  • G games played
  • AB at bats: Plate appearances, not including bases on balls, being hit by pitch, sacrifices, interference, or obstruction.
  • PA plate appearances
  • H hits
  • HR home runs
  • R runs scored; the number of times a player crosses home plate
  • RBI runs batted in: the number of runners who score due to a batter’s action
  • BB walks ‘base on balls’
  • IBB intentional base on balls, times walked intentionally by pitcher
  • HBP hit by pitch: walked as a result of being hit by a pitch
  • SF sacrifice fly: fly balls hit to the outfield which although caught for an out, allow a baserunner to advance
  • SH sacrifice hit: number of sacrifice bunts which allow runners to advance on the basepaths
  • GDP ground into double-play: number of ground balls that became double plays
  • SB stolen bases
  • CS number of times caught stealing
  • AVG batting average
  • Pitches number of pitches faced
  • Balls number of balls faced
  • Strikes number of strikes faced
  • SO strike outs
  • BB_K walks / strike outs. Walk to strike out ratio
  • OBP on base percentage
  • SLG slugging average: total baseas achieved on hits / at bats
  • OPS on-base plus slugging: on-base percentage plus slugging average
  • ISO isolated power: a hitter’s ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
  • wOBA weighted on base average
  • WAR wins above replacement: a non-standard formula to calculate the number of wins a player contributes to his team over a “replacement-level player”
  • WPA_plus win probability added, positive total
  • WPA_minus win probability added, negative total

NCAA College Football D1 2019

Each row in the cfb_2019_games.csv data set refers to a single game played between two NCAA D1 schools during the 2019 season. Most of the variables are self-explanatory. Venue and team ids have been provided but will not necessarily be needed. The definitions for less clear variables are asl follows:

  • id game id
  • netural_site was the game played on a neutral site, meaning at neither team’s home stadium?
  • conference_game is the game between two opponents from the same conference?
  • excitement_index a numerical value measuring the excitement of the game, calculated using win probability throughout the game.
  • home_1_pts - home_4_pts the amount of points scored by the home team in the 1st/2nd/3rd/4th quarters
  • away_1_pts - away_4_pts the amount of points scored by the away team in the 1st/2nd/3rd/4th quarters

Overwatch League Results and Odds

Each row in the overwatch_odds.csv data set contains data on a single Overwatch League match played between 2018 and 2020. Data includes the two teams, stage, winner, as well as information on the two teams success in the season thus far and in their history up until that point. Columns include:

  • id game id
  • corona_virus_isolation was the game played under corona virus isolation measures?
  • t1_wins_season  t2_wins_season how many games team 1 has won in the season prior to the game (t2 for team 2)
  • t1_losses_season  t2_losses_season how many games team 1 has lost in the season prior to the game (t2 for team 2)
  • t1_win_percent team 1 (t2 for team 2) win percentage in the season, in the last X games or all-time depending on variable name
  • t1_odds betting odds for team 1 to win the game.

Positive figures: The odds state the winnings on a 100 dollar bet (e.g. american odds of 110 would win 110 on a 100 dollar bet.)

Negative figures: The odds state how much must be bet to win 100 profit (e.g. american odds of -90 would win 100 on a 90 dollar bet.)

  • t2_odds betting odds for team 2 to win the game
  • t1_probability the implied win probability for team 1 given the betting odds
  • t2_probability the implied win probability for team 2 given the betting odds

Women’s NCAA D1 Soccer 2018 & 2019 Team Statistics

Each row in the womens_ncaa_soccer_20182019.csv data set refers to the statistics for a single school in a particular season. There are 668 team-school combinations spanning 2018 and 2019. Variables include:

  • assists the total number of assists earned by players on the team
  • team_games games played
  • assists_gp total assists earned per game played
  • corners corner kicks taken. This variable is unavailable for 2018
  • corners_gp corner kicks taken per game played. This variable is unavailable for 2018
  • fouls total fouls called on the team.
  • fouls_gp fouls called on the team per game played
  • ga goals against
  • team_min total minutes played by the team, including stoppage time
  • gaa goals against per game played
  • ps penalty kicks scored on
  • psatt penalty kicks attempted
  • pk_pct percentage of penalty kicks completed
  • points total points (goals + assists) accumulated for all players on the team
  • points_gp points accumulated by players on the team per game played
  • saves saves made by team goalkeepers
  • save_pct percentage of shots faced that goalkeepers saved
  • saves_gp number of saves made per game played
  • goals total goals scored by team
  • gpg the number of goals scored by the team per game played
  • sog total shots on goal
  • shatt total shot attempts
  • sog_pct percentage of shot attempts that were on goal
  • won games won
  • lost games lost
  • tied games tied
  • win_pct winning percentage
  • sog_gp number of shots on goal per game played
  • season the season the data refers to

Women’s NCAA D1 Volleyball 2018 & 2019 Team Statistics

Each row in the womens_ncaa_volleyball_20182019.csv data set refers to the statistics for a single school in a particular season. There are 666 team-school combinations spanning 2018 and 2019. Variables include:

  • s number of sets played
  • aces aces hit. An ace is a serve which lands in the opponent’s court without being touched, or is touched, but unable to be kept in play by one or more receiving team players
  • aces_per_set aces earned per set played
  • assists total team assists. Assists are awarded to a player who passes the ball to a teammate who attacks the ball for a kill. Can be awarded off a dig (first contact), provided the attack comes on the second contact
  • assists_per_set assists earned per set played
  • block_solos total team solo blocks. Players blocks the ball into the opponent’s court leading to a point or side out
  • block_assists total team assisted blocks
  • blocks_per_set total team blocks per set
  • digs total team digs. A dig occurs when a player passes the ball which has been attacked by the opposition. Digs are only given when players receive an attacked ball and it is kept in play
  • digs_per_set team digs per set played
  • kills team kills. An attack by a player that is not returnable by the receiving player on the opposing team and leads directly to a point or loss of rally
  • errors total team serve errors
  • total_attackstotal attack attempts. An attack is any overhead contact of the ball designed to score
  • hit_pct Hitting percentage is calculated by totaling kills, subtracting the hitting errors, then dividing that number by the total number of attack attempts.
  • kills_per_set kills earned per set played
  • w team wins
  • l team losses
  • win_pct team winning percentage
  • season season

Women’s NCAA D1 Lacrosse 2018, 2019 & 2020 Team Statistics

Each row in the womens_ncaa_lacrosse.csv data set refers to the D1 Women’s lacrosse statistics for a single school in a particular season. There are 348 team-school combinations spanning 2018, 2019 and 2020. Due to the Corona Virus pandemcic no team has played more than 10 games in 2020. Note that all _gp variables are the per game played versions of the variable they name (e.g. assists_gp is total team assists per game played). Other variable definitions are:

  • assists total team assists. The player who passes the ball to the player who scores a goal is credited with an assist
  • caused_tos total turnovers caused by the team. Also referred to as ‘takeaways’
  • draw_controls total team draw controls. A draw control occurs when a player successfully gains control of the ball after a draw.
  • fouls total team fouls
  • clears total team clears. A clear occurs when a team passes the offensive restraining line and is clearly able to get an offensive attempt.
  • clr_att total team attempted clears
  • clr_pct percent of team clear attempts that were succesfull
  • opp_dc opponents draw control total
  • drawc_control_pct percent of total draws that the team controlled
  • freepos_goals total free positiong goals. Free-position shot in women’s lacrosse is similar to a foul shot in basketball, awarded to an offensive player when a defender commits a major foul inside the 8-meter arc
  • freepos_shots total free positioning shots taken
  • free_position_pct percent of free position shots which resutled in goals
  • goals total team goals scored
  • points total points earned by all players on the team (goals + assists)
  • team_min total team minutes played
  • goals_allowed total goals allowed
  • saves total saves from all team goalkeepers
  • sv_pct team percentage of shots allowed which were saved, and not goals against
  • ga_gp goals allowed per game played
  • margin difference in goals scored - goals allowed per game played
  • gf_gp goals scored per game played
  • sog total shots on goal
  • turnovers total team turnovers committed
  • won games won
  • lost games lost
  • win_pct team winning percentage
  • yellow_cards total yellow cards earned by all players on the team
  • season season
