Overview
This project will begin on Friday, June 19 and involve a 15 minute presentation one week later on Friday, June 26. There is also a written report with a maximum of 4 pages due by 5:00pm EST on Wednesday, July 1st. Both the presentation and written submission will follow the same structure: starting with a summary of key takeaways and then walking through EDA, the modeling process, results and conclusions. Teams will work in groups of 2, and data sets will be assigned randomly.
The goal of this project is to practice planning, creating, evaluating and interpreting linear regression models in R. Each team will fit a linear model using one of the continuous variables in their provided data set. Models should be tested with out-of-sample prediction methods, and must be interpretable. Making a model nedlesly complicated can lead to overfitting and make results difficult to interpret.
The first slide of the presentation, and first page of the written report, will be for key takeaways. For the presentation this can be in the form of a single graph, bullet points or a statement. For the written report it should be a combination of the three. This is your work in elevator pitch form, meant to grab the audiences attention and give them a reason to be interested. In the working world you may have only a short time to pitch results to superiors, and will have to start with the conclusions before explaining your process.
The rest of the presentation, and second page and beyond in the written report will follow a more traditional data anaylsis report format. The format will be as follows:
- Introduction of the data set & sport
- Exploratory Data Analysis conducted
- Modeling analysis & model validation
- this step includes variable selection, error rates, and checking diagnostic plots
- Model Results
- this step should include interpretation of the coefficients in the model
- Conclusion
Deliverables & Timeline
There will be 3 different deliverable deadlines:
Tuesday, June 23 @ 4:00pm EST - Each student will push an Rmarkdown file with their analysis so far to thier GitHub accounts for review. We will then provide feedback on the code submitted.
Friday, June 26 @ 12:00pm EST - Slides must be completed and ready for presentation. Send your slides to Nick’s email ncitrone@pittsburghpenguins.com . All code and visualizations must be done in R, but the slides may be created in any program. Presentations will be 15 minutes long with 5 minutes for Q&A.
As a reminder, the presentation should start with a Key Takeaway slide, and then lead into a traditional data analysis report (Introduction, EDA, Modeling, Results, Conclusion)
Wednesday, July 1 @ 5:00pm EST - Each team’s final written report must be emailed to Nick’s email ncitrone@pittsburghpenguins.com .
As a reminder, the written report will have the ‘model pitch’ on the first page, which is a short executive summary and potentially a key graphic. The second page and beyond of the written file will follow the traditional data analysis report format (Introduction, EDA, Modeling, Results, Conclusion). These last pages should set up and support your key takeaways.
Notes
Although this project will use simple linear regression, feel free to contact any of the instructors if you have questions about potential GLM (poisson, logistic) models using any of the data sets provided.
Data Sets
There are eight different datasets for the regression projects (three of which were generated via the init_regression_project_data.R
script):
- nba_team_season_summary.csv - summary of regular season performance for each NBA team since 2003, courtesy of NBA stats via the ‘nbastatR package’,
- tennis_2013_2017_GS.csv - tennis grand slam statistics for 3066 ATP and WTA matches between 2013 and 2017. Data from Jeff Sackman’s tennis data repo, retreived by Stephanie Kovalchik’s
R deuce
package, and synthesized in Gallagher, Frisoli, and Luby’s R courtsports
package.
- baseball_batting.csv - MLB player season level batting statistics for 1429 player-seasons from 2010 to 2019. Data generated by FanGraphs and accessed courtesy of the
baseballr
package,
- cfb_2019_games.csv - Results from all NCAA D1 College Football games in 2019. Includes team data, final score and an excitement rating for the game. Data accessed via the
cfbscrapR
package,
- overwatch_odds.csv - Overwatch E-Sports League head to head match results with betting odds data. Data from Kaggle’s E-Sports Data Sets,
- womens_ncaa_soccer_20182019.csv - NCAA Women’s D1 soccer offensive and defensive team statistics from the 2018 and 2019 seasons. Data acquired from NCAA.com.
- womens_ncaa_volleyball_20182019.csv - NCAA Women’s D1 volleyball offensive and defensive team statistics from the 2018 and 2019 seasons. Data acquired from NCAA.com.
- womens_ncaa_lacrosse.csv - NCAA Women’s D1 lacrosse offensive and defensive team statistics from the 2018, 2019 and shortened 2020 seasons. Data acquired from NCAA.com.
NBA team season summary data
Each row in the nba_team_season_summary.csv dataset corresponds to a single NBA team in a single regular season dating back to 2003. The column names self-explanatory, but note that the columns ending with *_perc
mean the percentage based statistics.
Tennis grand slams data
Each row in the data corresponds to a grand slam match between two players. A variety of summary statistics of the match are reported along with winner and loser information. Variables include:
tournament
- one of the four grand slams: Australian Open, French Open, US Open, and Wimbledon
year
winner_name
and loser_name
winner_rank
and loser_rank
according to ATP or WTA, respectively at the time of tournament
Retirement
whether the match ended in a retirement (i.e. one person was unable to finish the match). Logical – TRUE means the match ended in retirement
Tour
either WTA or ATP
round
- R128
Round of 128, R64
- Round of 64, R32
- Round of 32, R16
Round of 16, QF
Quarter Final, SF
Semi Final, and F
Final
w_*
and l_*
stands for winner and loser, respectively where the suffix is one of many summary statistics including
ave_serve_speed
in mph
n_aces
number of aces
n_winners
number of winners including aces
n_netpt_w
number of net points won
n_netpt
number of net points played
n_bp_w
number of break points won (to break the opponent)
n_bp
number of break points (to break the opponent)
n_ue
number of unforced errors
n_sv
number of serves
n_sv_w
number of service points won
MLB Batting Statistics 2010-2019
Each row in the baseball_batting.csv data corresponds to the batting statistics for a single player in a single season between 2010 and 2019. THe first few variables as well as the singles, doubles and triples are self-explanatory, and the other baseball variables mean as follows:
G
games played
AB
at bats: Plate appearances, not including bases on balls, being hit by pitch, sacrifices, interference, or obstruction.
PA
plate appearances
H
hits
HR
home runs
R
runs scored; the number of times a player crosses home plate
RBI
runs batted in: the number of runners who score due to a batter’s action
BB
walks ‘base on balls’
IBB
intentional base on balls, times walked intentionally by pitcher
HBP
hit by pitch: walked as a result of being hit by a pitch
SF
sacrifice fly: fly balls hit to the outfield which although caught for an out, allow a baserunner to advance
SH
sacrifice hit: number of sacrifice bunts which allow runners to advance on the basepaths
GDP
ground into double-play: number of ground balls that became double plays
SB
stolen bases
CS
number of times caught stealing
AVG
batting average
Pitches
number of pitches faced
Balls
number of balls faced
Strikes
number of strikes faced
SO
strike outs
BB_K
walks / strike outs. Walk to strike out ratio
OBP
on base percentage
SLG
slugging average: total baseas achieved on hits / at bats
OPS
on-base plus slugging: on-base percentage plus slugging average
ISO
isolated power: a hitter’s ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
wOBA
weighted on base average
WAR
wins above replacement: a non-standard formula to calculate the number of wins a player contributes to his team over a “replacement-level player”
WPA_plus
win probability added, positive total
WPA_minus
win probability added, negative total
Overwatch League Results and Odds
Each row in the overwatch_odds.csv data set contains data on a single Overwatch League match played between 2018 and 2020. Data includes the two teams, stage, winner, as well as information on the two teams success in the season thus far and in their history up until that point. Columns include:
id
game id
corona_virus_isolation
was the game played under corona virus isolation measures?
t1_wins_season
t2_wins_season
how many games team 1 has won in the season prior to the game (t2 for team 2)
t1_losses_season
t2_losses_season
how many games team 1 has lost in the season prior to the game (t2 for team 2)
t1_win_percent
team 1 (t2 for team 2) win percentage in the season, in the last X games or all-time depending on variable name
t1_odds
betting odds for team 1 to win the game.
Positive figures: The odds state the winnings on a 100 dollar bet (e.g. american odds of 110 would win 110 on a 100 dollar bet.)
Negative figures: The odds state how much must be bet to win 100 profit (e.g. american odds of -90 would win 100 on a 90 dollar bet.)
t2_odds
betting odds for team 2 to win the game
t1_probability
the implied win probability for team 1 given the betting odds
t2_probability
the implied win probability for team 2 given the betting odds
Women’s NCAA D1 Soccer 2018 & 2019 Team Statistics
Each row in the womens_ncaa_soccer_20182019.csv data set refers to the statistics for a single school in a particular season. There are 668 team-school combinations spanning 2018 and 2019. Variables include:
assists
the total number of assists earned by players on the team
team_games
games played
assists_gp
total assists earned per game played
corners
corner kicks taken. This variable is unavailable for 2018
corners_gp
corner kicks taken per game played. This variable is unavailable for 2018
fouls
total fouls called on the team.
fouls_gp
fouls called on the team per game played
ga
goals against
team_min
total minutes played by the team, including stoppage time
gaa
goals against per game played
ps
penalty kicks scored on
psatt
penalty kicks attempted
pk_pct
percentage of penalty kicks completed
points
total points (goals + assists) accumulated for all players on the team
points_gp
points accumulated by players on the team per game played
saves
saves made by team goalkeepers
save_pct
percentage of shots faced that goalkeepers saved
saves_gp
number of saves made per game played
goals
total goals scored by team
gpg
the number of goals scored by the team per game played
sog
total shots on goal
shatt
total shot attempts
sog_pct
percentage of shot attempts that were on goal
won
games won
lost
games lost
tied
games tied
win_pct
winning percentage
sog_gp
number of shots on goal per game played
season
the season the data refers to
Women’s NCAA D1 Volleyball 2018 & 2019 Team Statistics
Each row in the womens_ncaa_volleyball_20182019.csv data set refers to the statistics for a single school in a particular season. There are 666 team-school combinations spanning 2018 and 2019. Variables include:
s
number of sets played
aces
aces hit. An ace is a serve which lands in the opponent’s court without being touched, or is touched, but unable to be kept in play by one or more receiving team players
aces_per_set
aces earned per set played
assists
total team assists. Assists are awarded to a player who passes the ball to a teammate who attacks the ball for a kill. Can be awarded off a dig (first contact), provided the attack comes on the second contact
assists_per_set
assists earned per set played
block_solos
total team solo blocks. Players blocks the ball into the opponent’s court leading to a point or side out
block_assists
total team assisted blocks
blocks_per_set
total team blocks per set
digs
total team digs. A dig occurs when a player passes the ball which has been attacked by the opposition. Digs are only given when players receive an attacked ball and it is kept in play
digs_per_set
team digs per set played
kills
team kills. An attack by a player that is not returnable by the receiving player on the opposing team and leads directly to a point or loss of rally
errors
total team serve errors
total_attacks
total attack attempts. An attack is any overhead contact of the ball designed to score
hit_pct
Hitting percentage is calculated by totaling kills, subtracting the hitting errors, then dividing that number by the total number of attack attempts.
kills_per_set
kills earned per set played
w
team wins
l
team losses
win_pct
team winning percentage
season
season
Women’s NCAA D1 Lacrosse 2018, 2019 & 2020 Team Statistics
Each row in the womens_ncaa_lacrosse.csv data set refers to the D1 Women’s lacrosse statistics for a single school in a particular season. There are 348 team-school combinations spanning 2018, 2019 and 2020. Due to the Corona Virus pandemcic no team has played more than 10 games in 2020. Note that all _gp
variables are the per game played versions of the variable they name (e.g. assists_gp is total team assists per game played). Other variable definitions are:
assists
total team assists. The player who passes the ball to the player who scores a goal is credited with an assist
caused_tos
total turnovers caused by the team. Also referred to as ‘takeaways’
draw_controls
total team draw controls. A draw control occurs when a player successfully gains control of the ball after a draw.
fouls
total team fouls
clears
total team clears. A clear occurs when a team passes the offensive restraining line and is clearly able to get an offensive attempt.
clr_att
total team attempted clears
clr_pct
percent of team clear attempts that were succesfull
opp_dc
opponents draw control total
drawc_control_pct
percent of total draws that the team controlled
freepos_goals
total free positiong goals. Free-position shot in women’s lacrosse is similar to a foul shot in basketball, awarded to an offensive player when a defender commits a major foul inside the 8-meter arc
freepos_shots
total free positioning shots taken
free_position_pct
percent of free position shots which resutled in goals
goals
total team goals scored
points
total points earned by all players on the team (goals + assists)
team_min
total team minutes played
goals_allowed
total goals allowed
saves
total saves from all team goalkeepers
sv_pct
team percentage of shots allowed which were saved, and not goals against
ga_gp
goals allowed per game played
margin
difference in goals scored - goals allowed per game played
gf_gp
goals scored per game played
sog
total shots on goal
turnovers
total team turnovers committed
won
games won
lost
games lost
win_pct
team winning percentage
yellow_cards
total yellow cards earned by all players on the team
season
season
