Sam Speaking

About The Conference

Now in its sixth year, the Carnegie Mellon Sports Analytics Conference is dedicated to highlighting the latest sports research from the statistics and data science community.


Interested in presenting your research at CMSAC? Then submit an abstract to present a poster at the conference using the form below!


You can find out more about #CMSAC22 in the schedule below, along with information about this year's speakers. Check out our 2019, 2020, and 2021 conferences.


Registration

Register now!

You can register to attend #CMSAC22 in-person or virtual. Virtual attendees will be able to attend the workshop and speaker events (excluding the mock interview) via a zoom webinar. While virtual attendees will able to ask questions, priority will be given to in-person attendees. Additionally, in-person attendees will have access to the poster session and networking opportunities throughout the conference. Click the button above to register and see below for pricing.

Virtual Attendance Registration (limited access to networking opportunities)

  • High School / Undergrad / Grad student Conference + Workshop: FREE (with school ID)
  • Non-student conference: $25
  • Non-student workshop: $10
  • Non-students Conference + Workshop: $30

In-Person Registration (until Oct 28th)

  • High School / Undergrad / Grad students Conference: $15 (with school ID)
  • High School / Undergrad / Grad students Workshop: $10 (with school ID)
  • High School / Undergrad/ Grad students Conference + Workshop: $20 (with school ID)
  • Non-students Conference: $50
  • Non-students Workshop: $20
  • Non-students Conference + Workshop: $60
Registering indicates agreement to abide by the Code of Conduct .

Hotel information

We have a room block with Hilton Garden Inn (reserve now with this link).

Conference Location

Carnegie Mellon University
Giant Eagle Auditorium
4909 Frew St, Pittsburgh, PA 15213
From PIT Airport

1. Head northeast on Airport Blvd
2. Keep left to stay on Airport Blvd - 0.6 mi
3. Keep left to stay on Airport Blvd - 0.7 mi
4. Continue straight to stay on Airport Blvd - 0.2 mi
5. Keep left at the fork, follow signs for
I-376 E/I-79 E/Pittsburgh/Pennsylvania Turnpike E and
merge onto I-376 E - 0.6 mi
6. Merge onto I-376 E - 16.4 mi
7. Keep right to stay on I-376 E - 2.1 mi
8. Take exit 72A to merge onto Forbes Ave toward Oakland - 0.3 mi
9. Merge onto Forbes Ave - 1.0 mi
10. Turn right onto Schenley Drive Extension - 449 ft
11. Turn left onto Schenley Drive - 0.2 mi
12. Turn left onto Frew St 0.2 mi
13. Destination will be on the left


Schedule Details

Big Data Bowl Workshop in Giant Eagle Auditorium, led by Ron Yurko

  • 5 PM

    Access slides for workshop here

  • 6:30 PM

    Break for food (pizza to be provided)

  • 6:45 to 7:15 PM

    Q&A with previous Big Data Bowl Winners

Conference sessions in Giant Eagle Auditorium

  • 8:00 AM

    Registration

  • 8:50 AM

    Welcome and Opening Remarks

    CMU Statistics & Data Science
  • 9:00 AM

    Keynote Address: Doug Fearing

    Zelus Analytics
  • 9:45 AM

    Coffee Break

  • 10:00 AM

    Women’s Olympic Hockey - Developing a better Power Play

    Robyn Ritchie
  • 10:30 AM

    Data in the W

    Jacob Goldstein
  • 11:00 AM

    Coffee Break

  • 11:20 AM

    Open-Sourcing the Sports Analytics Hiring Process

    Sam Ventura, Katy McKeough, and Carleen Markey
  • 12:05 PM

    SCORE Spotlight

    Rebecca Nugent
  • 12:20 PM

    Poster Previews

    Poster Presenters
  • 12:30 PM

    Lunch and Poster Session

  • 1:45 PM

    Determining the Critical Influences Affecting Offensive Outcomes of Shifted At-Bats

    CMSACamp: Amber Potter and Nicholas Esposito
  • 2:00 PM

    Is Breanna Stewart the Lebron James of the WNBA? Developing WNBA and NBA Archetypes and Playstyle Comparisons

    CMSACamp: Amor Ai and Mykalyster Homberg
  • 2:15 PM

    A Regularized Adjusted Plus-Minus Model in Soccer with Box Score Prior

    CMSACamp: Boyuan (Gary) Zhang
  • 2:30 PM

    Break

  • 2:45 PM

    Does Icing the Kicker Work in the NFL?

    CMSACamp: Adriana Gonzalez Sanchez and Sierra Martinez
  • 3:00 PM

    Identifying Defensive Coverage Types in NFL Tracking Data

    CMSACamp: Vivek Shah
  • 3:15 PM

    Paying Players What They Are Worth

    CMSACamp: Eric Warren
  • 3:30 PM

    Does Not Playing Hockey Make You Worse At Hockey?

    CMSACamp: Michele Sezgin and Jackie Jovanovic
  • 3:45 PM

    Break

  • 4:00 PM

    Intro to Reproducible Research Competition

  • 4:00 PM

    Refining Search Spaces in Hierarchical Tournament Brackets using Chalk Index

    Student-Track Methods Winner: Chris Toukmaji
  • 4:20 PM

    fRisbee: An Open-Source Package for College and Professional Ultimate Frisbee Data and Modeling

    Student-Track Data/Software Finalist: Benjamin Wieland and Akiva Dienstfrey
  • 4:30 PM

    fastRhockey: A Package For Women’s Hockey Data

    Student-Track Data/Software Finalist: Ben Howell
  • 4:40 PM

    Break / Voting (Student-Track)

  • 5:00 PM

    hockeyR: Easy access to detailed NHL play-by-play data

    Open-Track Data/Software Winner: Dan Morse
  • 5:10 PM

    The Racial Imbalance in College Football Coaching

    Open-Track Methods Finalist: Robert Binion and Mark Wood
  • 5:30 PM

    Estimating Aging Curves: Using Multiple Imputation to Examine Career Trajectories of MLB Offensive Players

    Open-Track Methods Finalist: Quang Nguyen
  • 5:50 PM

    cricWAR: A reproducible system for evaluating player performance in limited-overs cricket

    Open-Track Methods Finalist: Hassan Rafique
  • 6:10 PM

    Voting (Open-Track)

  • 6:20 PM

    Awards and Closing Remarks

  • 6:30 to 8:00 PM

    Networking Reception (Baker Patio and coffee lounge)


Conference Speakers

Conference Keynote Speaker

Doug Fearing

Doug Fearing

Zelus Analytics

Biography:

Doug Fearing is the Co-Founder & CEO of Zelus Analytics. Over the past three years, Zelus has built one of the largest analytics teams in sports, serving pro teams in the MLB, NBA, NFL, NHL, IPL, and European soccer. Prior to Zelus, Doug founded the LA Dodgers Baseball R&D department and served as its Director for four seasons. Through his role, he helped integrate data science, performance technologies, and software engineering into all functions of Baseball Operations. His Dodgers experience culminated with World Series appearances in 2017 and 2018. Doug received his B.S. in Computer Science from CMU and his Ph.D. in Operations Research from MIT. After MIT, he spent five years in academia, teaching at Harvard Business School and the UT Austin McCombs School of Business. During that time, he consulted for the Tampa Bay Rays as a Senior Advisor to the Baseball R&D department.

Conference Speakers

Robyn Ritchie

Robyn Ritchie

Simon Fraser University

Robyn Ritchie is a PhD candidate at Simon Fraser University outside of Vancouver, Canada. Her current research focuses on advancing curling analytics with statistical learning to inform decision making in the game. She completed her Masters in statistics at the University of Manitoba where she looked to estimate the scoring rates of various teams in the English Premier League, as well as comparing home and away performances and scoring patterns throughout additional time. In the past year, Robyn and her team won the NFL’s Big Data Bowl competition with their project which looked to determine the optimal path to get the punt returner to the end zone. Her team was the first college entry to win the competition and she was the first ever female grand champion. After this, Robyn put together another team to enter and win the Big Data Cup. Her team used women’s ice hockey power play data from the 2022 Olympics to gain insight into passing.

Women’s Olympic Hockey - Developing a better Power Play

Historically, tracking data in hockey is not publicly available. Even less available is data on women’s professional sports. This year’s Big Data Cup allowed researchers to dive into unknown territory and advance both the game of hockey and the women’s competition. Using event and tracking data from the elimination round games during the 2022 Winter Olympics, we evaluated passes in order to assess players’ risk-reward behaviors in these high intensity moments. We developed a physics- and motion-based model for both the puck and the players to determine potential targets for any pass and assessed interception chances. In addition, we incorporated rink control and scoring probability to better understand the factors that go into the decision to attempt an action, the likelihood of success and striving for the desired end result – a goal. After developing a multitude of novel metrics, we evaluated passes made throughout the available power plays and compared them to the optimal options at that time. This can be used to identify the risky players from the conservative players and advise a coach on who should be on the power play to target their desired result. Additionally, this can be used from a player development standpoint to improve a player's passing ability and to develop better passing strategies on the power play.

Jacob Goldstein

Jacob Goldstein

Monumental Sports & Entertainment

Jacob Goldstein is beginning his third year as a Research Analyst at Monumental Sports & Entertainment, supporting the NBA's Wizards, WNBA's Mystics, G-League's Go-Go and 2K League’s Wizards District Gaming. His research aids decision making across many aspects of basketball operations including the front office, coaching, and player performance. Prior to joining MSE, Jacob was a co-founder of BBall Index where he built a database of unique basketball evaluation metrics including Player Impact Plus-Minus. He received a degree in Mechanical Engineering from Lehigh University.

Data in the W

With more talent across the WNBA than ever before, teams are turning to data analysis to find new competitive advantages. From the unique position of supporting multiple teams across multiple leagues, Jacob compares the availability and accessibility of data across leagues, explores what it takes to support a modern WNBA team, and looks at how the rise in analytics is changing how teams operate.

Open-Sourcing the Sports Analytics Hiring Process

Sam Ventura

Sam Ventura

Buffalo Sabres

Sam Ventura is the Vice President of Hockey Strategy and Research for the Buffalo Sabres, and an affiliated faculty member at Carnegie Mellon University’s Department of Statistics & Data Science. He also serves as an advisory board member for the University of Pittsburgh’s MS in Quantitative Economics program. Prior to that, he was the Director of Hockey Operations and Director of Hockey Research for the Pittsburgh Penguins; a professor of Statistics at CMU, where he also received his PhD (Statistics, 2015); an assistant coach for Carnegie Mellon’s ice hockey team; and the faculty advisor to the CMU Sports Analytics Club. Sam has co-authored multiple R packages for open-source data collection and analysis, including nhlscrapr, nflscrapR, and spew, and he co-founded war-on-ice.com. He co-organizes the annual Carnegie Mellon Sports Analytics Conference.

Katy McKeough

Katy McKeough

Boston Red Sox

Katy McKeough is an Analyst for the Boston Red Sox Baseball Analytics team. At the Red Sox, she works on statistical and machine learning models that help drive decision-making for Major League player acquisition and in-game strategy. Katy has a Ph.D. in Statistics from Harvard University. She is also a Carnegie Mellon alumni and graduated with a B.S. in Physics and Statistics in 2015.

Carleen Markey

Carleen Markey

Carnegie Mellon University

Carleen is a physics PhD student at Carnegie Mellon University who has worked extensively with women's hockey data over the past four years, including public projects on the Premier Hockey Federation, Olympic Women's Hockey, the Canadian Women's Hockey League, as well as volunteering with Chatham Women's Ice Hockey doing private analytics projects. She primarily focuses on integrating and optimizing analytics in data-poor leagues, as well as systems research and player development.

CMSACamp 2022 Student Speakers

Amber Potter

Amber Potter (view report)

Duke University

Amber Potter is a senior at Duke University double majoring in statistics and computer science with a concentration in data science. At Duke, she is a member of the Sports Analytics Club and is a student manager of analytics for Duke Baseball. Motivated by her love for baseball, much of her work looks into capturing relationships associated with batting success among professional and college baseball. Her goal is to work as a research and development analyst for an MLB team, although she is also considering pursuing her PhD in statistics.

Determining the Critical Influences Affecting Offensive Outcomes of Shifted At-Bats

In recent years, the infield shift has become a topic of heated debate in the world of professional baseball. As use of the shift has increased, the number of people in opposition has grown as well. The counterargument is built on the idea that the shift “steals” hits at a rate that puts batters at a significant disadvantage, relative to the defense. However, shifting trends are observed to be different for right-handed batters in comparison with left-handed batters, a phenomenon that serves as the main motivation for this study . In this study, we utilize wOBA (weighted on-base average) as the primary statistic of interest for offensive production/outcomes by batter handedness and infield orientation, in addition to including an exploration of BABIP (batting average on balls in play). Our observations of wOBA by batter handedness since 2015 reflected our initial motivation, as league average wOBA for right-handed batters was higher in shifted situations than non-shifted situations. This was not the case for left-handed batters. We also determined that, because BABIP and wOBA followed different trends by handedness, factors outside of the BABIP calculation (walks, homeruns, swing tendencies, etc.) could be associated with right-handed batters’ relative success against the shift. This intuition was a basis for our modeling, which attempts to capture how the relationships between these factors and success against the infield shift differ by batter handedness.

Nicholas Esposito

Nicholas Esposito (view report)

University of Florida

Nicholas is a third-year undergraduate at the University of Florida studying data science with a minor in mathematics. Beyond academics and extracurriculars, he is always involved with sports, whether it is playing casually with friends, taking part in intramurals on campus, or watching games. Following completion of his undergraduate degree, he plans to pursue graduate school in the field of data science. His long-term career goal is to work in the data analytics department (research & development) of a professional sports team.

Determining the Critical Influences Affecting Offensive Outcomes of Shifted At-Bats

In recent years, the infield shift has become a topic of heated debate in the world of professional baseball. As use of the shift has increased, the number of people in opposition has grown as well. The counterargument is built on the idea that the shift “steals” hits at a rate that puts batters at a significant disadvantage, relative to the defense. However, shifting trends are observed to be different for right-handed batters in comparison with left-handed batters, a phenomenon that serves as the main motivation for this study . In this study, we utilize wOBA (weighted on-base average) as the primary statistic of interest for offensive production/outcomes by batter handedness and infield orientation, in addition to including an exploration of BABIP (batting average on balls in play). Our observations of wOBA by batter handedness since 2015 reflected our initial motivation, as league average wOBA for right-handed batters was higher in shifted situations than non-shifted situations. This was not the case for left-handed batters. We also determined that, because BABIP and wOBA followed different trends by handedness, factors outside of the BABIP calculation (walks, homeruns, swing tendencies, etc.) could be associated with right-handed batters’ relative success against the shift. This intuition was a basis for our modeling, which attempts to capture how the relationships between these factors and success against the infield shift differ by batter handedness.

Amor Ai

Amor Ai (view report)

Carnegie Mellon University

Amor Ai is a junior at Carnegie Mellon University, studying Statistics and Decision Science with a minor in Psychology. She is deeply passionate about utilizing and mobilizing data to tell stories and drawing on behavioral and psychological insights to better understand human behavior. Originally from the Bay Area, she played on the #1 ranked high school basketball team in the nation and developed an intense love for playing and watching sports. Now, she lives in Seattle and is a big sneaker enthusiast, an aspiring amateur CrossFit athlete, and a sucker for good podcasts.

Is Breanna Stewart the Lebron James of the WNBA? Developing WNBA and NBA Archetypes and Playstyle Comparisons

Even though interest in women’s sports has skyrocketed with the growing influence of social media and the heightened popularity of global stars, it has nevertheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts. The lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a persistent barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Therefore, we seek to provide more convenient and accessible information on the WNBA players in order to ultimately promote sustained fan engagement and interactions with the league. To do so, our project is centered around a public facing Shiny App that allows fans to compare WNBA players to specific archetypes and similar NBA players. We believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight, keeping it there for years to come.

Mykalyster Homberg

Mykalyster Homberg (view report)

Harvard University

Mykalyster Homberg is a junior at Harvard University, concentrating in Economics with a secondary in Spanish. At Harvard, he is a student-manager for the Men’s Basketball Team and a member of the Harvard Sports Analysis Collective. He is interested in the applications of data science within Salary Cap Management, Player Development, and Coaching in the NBA. His long-term career goal is to work in Basketball Operations and Strategy for an NBA team.

Is Breanna Stewart the Lebron James of the WNBA? Developing WNBA and NBA Archetypes and Playstyle Comparisons

Even though interest in women’s sports has skyrocketed with the growing influence of social media and the heightened popularity of global stars, it has nevertheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts. The lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a persistent barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Therefore, we seek to provide more convenient and accessible information on the WNBA players in order to ultimately promote sustained fan engagement and interactions with the league. To do so, our project is centered around a public facing Shiny App that allows fans to compare WNBA players to specific archetypes and similar NBA players. We believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight, keeping it there for years to come.

Adriana Gonzalez Sanchez (view report)

Slippery Rock University

Adriana Gonzalez Sanchez is a senior at Slippery Rock University where she is majoring in Physics and Mathematics with a minor in Statistics. She is currently interested in causal inference and sports analytics, more specifically tennis analytics. At Slippery Rock University she is a member of the E-Board of the Statistics Club, and a member of the women’s tennis team. After graduation, she plans to pursue a PhD in Statistics and Data Analytics. Beyond academia, she enjoys swimming and watching soccer.

Does Icing the Kicker Work in the NFL?

In football, the term “icing the kicker” refers to a play immediately before a field goal attempt in which the defensive team calls a timeout in the hope that it will affect the kicker’s ability to make the field goal. This is important to look at because it is often a game deciding play. This project is designed to try to predict if icing the kicker (i.e., calling a timeout before a field goal) impacts a kicker’s ability to make the field goal. To approach this problem, a causal inference perspective was taken due to the nature of the data, which was non-random observational data. The data used for this project was NFL play-by-play data from the 1999 season to the 2021 season, the treatment variable was whether a kick was iced or not, and the set of potential controls was all non-iced kicks. There were three different models created during the process, one using all data without matching, another using matched data on win probability, and the last one using a propensity score for kick success. Then these three models were fit using logistic regression, more specifically, using a generalized additive model (GAM), and the results were compared. The results showed that there was no evidence that a kicker’s ability to make a field goal is affected by timeouts. However, in the future, factors such as roof or surface type, as well as a running clock will be considered to reach new conclusions.

Sierra Martinez

Sierra Martinez (view report)

University of Oklahoma

I am a junior mechanical engineering student at the University of Oklahoma with an interest in sports analytics. I played competitive soccer through my freshman year of college, and have always loved sports, especially football, growing up in Texas.

Does Icing the Kicker Work in the NFL?

In football, the term “icing the kicker” refers to a play immediately before a field goal attempt in which the defensive team calls a timeout in the hope that it will affect the kicker’s ability to make the field goal. This is important to look at because it is often a game deciding play. This project is designed to try to predict if icing the kicker (i.e., calling a timeout before a field goal) impacts a kicker’s ability to make the field goal. To approach this problem, a causal inference perspective was taken due to the nature of the data, which was non-random observational data. The data used for this project was NFL play-by-play data from the 1999 season to the 2021 season, the treatment variable was whether a kick was iced or not, and the set of potential controls was all non-iced kicks. There were three different models created during the process, one using all data without matching, another using matched data on win probability, and the last one using a propensity score for kick success. Then these three models were fit using logistic regression, more specifically, using a generalized additive model (GAM), and the results were compared. The results showed that there was no evidence that a kicker’s ability to make a field goal is affected by timeouts. However, in the future, factors such as roof or surface type, as well as a running clock will be considered to reach new conclusions.

Vivek Shah

Vivek Shah (view report)

American School Bombay

Vivek Shah is a senior in high school at the American School Bombay. He enjoys working on data science and machine learning projects. He is deeply passionate about sports and loves watching football as well as playing basketball, soccer, and golf.

Identifying Defensive Coverage Types in NFL Tracking Data

The proliferation of data in sports has led to the increased use of analytics in strategic decisions and the development of more specialized features. However, publicly available statistics for defensive players in the National Football League (NFL) are limited. Nevertheless, the NFL has added chips to players’ shoulder pads that monitor the position, speed, and direction of each player every tenth of a second. A subset of this data is released to the public and can be used to construct new measures to analyze player performance on a play-by-play level. The aim of this paper is to develop an unsupervised-learning based approach to predict player-level soft labels for man and zone coverage in the NFL. This is done through the use of a multivariate t mixture model on the tracking data. The model displays both a high Adjusted Rand Index (ARI) value and a high accuracy on a manually labeled dataset of man and zone coverage and can be used to explore how the coverage prediction for a defensive player changes over the course of a play. This paper also introduces a website that provides a simple interface that enables users to label if a defensive player is in man or zone coverage and thus can be used to create a large crowd-sourced dataset of defensive coverage labels. The dataset can be used to further improve the model and to explore anomalous defensive situations where the model’s predictions are radically different from human labels.

Eric Warren

Eric Warren (view report)

North Carolina State University

Eric Warren is a senior at North Carolina State University, where he is studying Statistics. After graduation, he plans to pursue a career in data analytics with interests in sports and finance. He is an avid football and hockey fan, who enjoys watching and playing any sport in his free time.

Paying Players What They Are Worth

The salary cap is something that all sports have. In the National Hockey League there is no luxury tax or roll over cap, which is present in other professional sports. For this reason, optimizing spending with player performance is a must, since all teams are on a set budget. This project looks at each player's statistical outputs during a season and determines what the salary they should be receiving compared to their actual salary.

Boyuan Zhang

Boyuan (Gary) Zhang (view report)

New York University

Boyuan Zhang is a third-year undergraduate student at New York University, where he is joint majoring in Data Science and Computer Science, and double minoring in Mathematics and Business Studies. He is passionate about the insight from data and his academic interest is to develop, apply, and analyze machine learning techniques to solve real-world problems. Boyuan is currently a research assistant at the Visualization Imaging and Data Analysis Center (VIDA) at NYU Tandon School of Engineering. He is looking forward to pursuing graduate education in the field of data science and statistics.

A Regularized Adjusted Plus-Minus Model in Soccer with Box Score Prior

Evaluating the impact of individual players on their team’s performance is an important question to consider when analyzing team sports. Variants of the Adjusted Plus-Minus (APM) model have been viewed as all-in-one player’s performance evaluation matrices, and have been widely utilized in basketball and hockey. However, because of the low number of substitutions and scoring chances in soccer, the APM model has not been shown to be effective in identifying players' performance. This talk introduces a new kind of Regularized Adjusted Plus-Minus (RAPM) model, which incorporates priors generated from box score statistics into a regularized regression framework, performing point estimation on the player’s contribution to the expected goals per 90 minutes. In particular, using data from the 2021-2022 season of the English Premier League, we show that our RAPM model with box score prior has better predictability and interpretability than the APM model, RAPM model without priors, and RAPM model with FIFA ratings as prior. This model could be further utilized to evaluate the impact of player transfer, simulate teams' performance, and forecast players' market value.

Michele Sezgin

Michele Sezgin (view report)

Smith College

Michele is a senior Computer Science and Statistical & Data Sciences major at Smith College in Northampton, Massachusetts. She is interested in hockey analytics and hopes to pursue a career in sports analytics post-graduation. She is captain of the Smith College club hockey team and enjoys playing hockey, lifting, and listening to music in her free time.

Does Not Playing Hockey Make You Worse At Hockey?

The COVID-19 pandemic disrupted all aspects of life when it began in 2020, and professional sports were no exception. In 2020-2021, many hockey leagues had shortened or canceled seasons due to the restrictions caused by the pandemic. Some players participated in alternative leagues or tournaments while others did not record any games for that season. Our goal in this research was to investigate the impact of pandemic restrictions on hockey player development in the OHL. We defined our treatment variable as whether a player participated in at least one game during the COVID-19 impacted season (2020-2021), and we assessed the impact on a player’s performance (points per game) the following year. We used a variety of approaches including multiple variations of regression (particularly ordinary least squares, gamma regression, and mixed-effects models) and causal analysis in the form of propensity score matching and Bayesian Additive Regression Trees. In this talk, we will share our findings and our plans for future study.

Jackie Jovanovic

Jackie Jovanovic (view report)

Elon University

Jackie is a senior at Elon University majoring in Statistics and Music in the Liberal Arts. She is a member of the Baseball Analytics Club, part of Sigma Kappa Kappa Zeta, and an academic mentor. Her passions for analytics, music, and sports have culminated in many interesting projects that combine the fields. After graduation she hopes to pursue a career in baseball or hockey analytics. The headshot is attached.

Does Not Playing Hockey Make You Worse At Hockey?

The COVID-19 pandemic disrupted all aspects of life when it began in 2020, and professional sports were no exception. In 2020-2021, many hockey leagues had shortened or canceled seasons due to the restrictions caused by the pandemic. Some players participated in alternative leagues or tournaments while others did not record any games for that season. Our goal in this research was to investigate the impact of pandemic restrictions on hockey player development in the OHL. We defined our treatment variable as whether a player participated in at least one game during the COVID-19 impacted season (2020-2021), and we assessed the impact on a player’s performance (points per game) the following year. We used a variety of approaches including multiple variations of regression (particularly ordinary least squares, gamma regression, and mixed-effects models) and causal analysis in the form of propensity score matching and Bayesian Additive Regression Trees. In this talk, we will share our findings and our plans for future study.

Big Data Bowl Workshop Q&A Experts

Sam Ventura

Brendan Kumagai

Zelus Analytics, Simon Fraser University

Brendan Kumagai is a data scientist at Zelus Analytics and a Master’s in Statistics student at Simon Fraser University where he is currently doing research on the NHL draft. He was part of the championship-winning teams for the 2021 Big Data Cup and 2022 NFL Big Data Bowl and has previously interned with Stathletes, the Canadian Tire Sports Analytics team, and the McMaster Visual Neuroscience Lab.

Lucas Wu

Lucas Wu

Zelus Analytics, Simon Fraser University

Lucas Wu is a PhD student at Simon Fraser University and data scientist at Zelus Analytics. His research interests include Sports Analytics, Bayesian Statistics and Causal Inference. His team participated in the first three Big Data Bowl competitions, winning the college division in 2019, honorable mention in 2020, and finalist of the open division in 2021.

Carleen Markey

Matthew Reyers

Zelus Analytics

Matthew Reyers is a Machine Learning Engineer at Zelus Analytics working in Gridiron Football. He is a graduate of Simon Fraser University's Master of Statistics program and a former winner of the NFL Big Data Bowl alongside SFU teammates Lucas Wu, Dani Chu, and James Thomson. His team's work has aided the addition of new drills to the NFL Combine, pairing an NFL tradition with data informed alterations.

Robyn Ritchie

Robyn Ritchie

Simon Fraser University

Robyn Ritchie is a PhD candidate at Simon Fraser University outside of Vancouver, Canada. Her current research focuses on advancing curling analytics with statistical learning to inform decision making in the game. She completed her Masters in statistics at the University of Manitoba where she looked to estimate the scoring rates of various teams in the English Premier League, as well as comparing home and away performances and scoring patterns throughout additional time. In the past year, Robyn and her team won the NFL’s Big Data Bowl competition with their project which looked to determine the optimal path to get the punt returner to the end zone. Her team was the first college entry to win the competition and she was the first ever female grand champion. After this, Robyn put together another team to enter and win the Big Data Cup. Her team used women’s ice hockey power play data from the 2022 Olympics to gain insight into passing.

Marc Richards

Marc Richards

Kansas City Chiefs, University of Pittsburgh

Marc Richards is a Football Research Analyst for the Kansas City Chiefs. At the Chiefs, he supports both the Coaching Staff and Front Office in data driven decision making. Marc is also a PhD student in the Department of Statistics at the University of Pittsburgh. Prior to the Chiefs, Marc was a part of a team that won the 2021 NFL Big Data Bowl and finished as finalists in the 2022 competition. He received his bachelor's degree in mathematics from St. Olaf College where he also played college hockey. Bowl competitions, winning the college division in 2019, honorable mention in 2020, and finalist of the open division in 2021.

Meyappan Subbaiah

Meyappan Subbaiah

NFL (starting in November)

Meyappan Subbaiah was previously a Data/Product scientist at Zelus for the past 3 years and worked on various sports (baseball, cricket, and football). He is starting a new job as a Data Scientist in the NFL league office in November. He received his M.S in Business Analytics at the University of Texas. His team was a finalist in the open division of the Big Data Bowl 2021. Passionate about collegiate sports, he developed an R package called cfbscrapR, providing an easy way for college football fans to get data with embedded EPA/WPA models in it.


Reproducible Research Competition

Student-Track Methods Winner

Chris Toukmaji

Chris Toukmaji (view paper)

University of California, Santa Cruz

Chris Toukmaji is a senior at the University of California, Santa Cruz majoring in Computer Science and minoring in Statistics. His research interests include natural language processing (NLP), explainable artificial intelligence, neural representation, and statistical modeling of financial and sports analytics. Chris is actively involved in NLP research in the JLab, an NLP/Deep Learning Lab at UCSC, and leads the NeuroTechX Chapter at UC Santa Cruz. Chris’s industry experience includes internships at Capital One and SapientX. In his free time, Chris enjoys watching sports and developing models to predict the outcome.

Refining Search Spaces in Hierarchical Tournament Brackets Using Chalk Index

In this paper, we propose "Chalk Index", a new metric that quantifies the variation between a "true" bracket and the Chalk Bracket. The Chalk Index is then applied in bracket outcome prediction. We use the previous eleven years of NCAA Division 1 Men’s Basketball Tournaments as a case application of Chalk Index. The NCAA Division 1 Men’s Basketball Tournament, further referred to as "the tournament", is an annual single elimination bracket-style tournament of 64 college basketball teams, all vying to be crowned the champions of the collegiate basketball world. Each tournament team has each of their game analytics logged by the statistical database "KenPom". However, KenPom updates its analytics as the tournament is played, so historical data will biased by tournament results. Since KenPom does not store the pre-tournament data, we use the WayBackMachine which allows us to view the web-pages before KenPom has updated its database to reflect the tournament results. We use the combination of KenPom and the WayBackMachine to parse and curate pre-tournament datasets for all the tournament teams from 2011 to 2022 - with the exception of 2020 when the tournament was canceled due to the COVID-19 pandemic. Previous research has shown that there are 2^63 (approximately 9.22 x 10^18) possible tournament brackets each year, only one of which perfectly predicts each game. We observe that sampling from the distribution of historical true Chalk Indexes improves performance in Stochastic Generation and Logistic Regression prediction methods.

Student-Track Data + Software Finalists

Ben Howell

Ben Howell (view paper)

University of Texas

Ben Howell is a senior Sport Management student at the University of Texas who works for the Texas Baseball team as an Analytics Intern. He has spent the past two summers interning with the San Diego Padres as a Baseball Research & Development Intern and pursues research in women’s hockey, creating the open source fastRhockey package and more.

fastRhockey: A Package For Women’s Hockey Data

The boxscore and roster data scraped through fastRhockey allows for the aggregation of summary statistics and enable fantasy PHF hockey to be played. The processed play-by-play data fastRhockey delivers contains every event that occurs within a PHF game, the primary/secondary players, as well as player-on-ice data for goals. fastRhockey was designed to increase access to women’s hockey data, and the data returned by the package enables the women’s hockey analytics space to grow, replicating and reproducing work done regarding men’s hockey, as well as novel research specific to the PHF. fastRhockey was developed as a part of the SportsDataverse universe, and was merged with hockeyR to include NHL data and capabilities under the fastRhockey umbrella, with plans to continue building out women’s hockey functions as the women’s hockey landscape grows.

BenjaminWieland

Benjamin Wieland (view paper)

University of Virginia

Ben Wieland is a sophomore at the University of Virginia majoring in applied statistics with a concentration in data science. He is a member of the Sports Analytics and Statistics Laboratory and writes for Virginia sports blog Streaking The Lawn. His work focuses on making new data accessible and making current data understandable through effective data analysis and visualization. His goal is to work as a data analyst for a professional sports team and potentially pursue an M.S. in Data Science.

fRisbee: An Open-Source Package for College and Professional Ultimate Frisbee Data and Modeling

The popularity of Ultimate Frisbee at the collegiate level is skyrocketing: over 650 men’s and women’s teams competed in the 2021-22 season. fRisbee is an open-source R package designed to create a fast and accessible way to access collegiate men’s and women’s Ultimate Frisbee rankings, results, and historical data. The package provides ease of access to previously infeasible datasets, including access to aggregated game-by-game information on over 5,400 Ultimate Frisbee matches at the collegiate level during the 2021-22 season; it also includes web scraping functions wrapped around www.frisbee-rankings.com to access up-to-date results and data. The package also makes Ultimate Frisbee modeling simple: it comes equipped with two pre-trained and easily deployable models for win probability and margin of victory projections, and includes easy access to the aforementioned historical data repositories for users to train and fit their own models.

Akiva Dienstfrey

Akiva Dienstfrey (view paper)

University of Virginia

Akiva is a second-year undergraduate at the University of Virginia majoring in economics and statistics with a concentration in data science. At UVA, he is a member of the Data Science Club and plays on the Ultimate Frisbee Club team. After graduation, he hopes to pursue a career in data analytics with interests in sports and utilization of resources.

fRisbee: An Open-Source Package for College and Professional Ultimate Frisbee Data and Modeling

The popularity of Ultimate Frisbee at the collegiate level is skyrocketing: over 650 men’s and women’s teams competed in the 2021-22 season. fRisbee is an open-source R package designed to create a fast and accessible way to access collegiate men’s and women’s Ultimate Frisbee rankings, results, and historical data. The package provides ease of access to previously infeasible datasets, including access to aggregated game-by-game information on over 5,400 Ultimate Frisbee matches at the collegiate level during the 2021-22 season; it also includes web scraping functions wrapped around www.frisbee-rankings.com to access up-to-date results and data. The package also makes Ultimate Frisbee modeling simple: it comes equipped with two pre-trained and easily deployable models for win probability and margin of victory projections, and includes easy access to the aforementioned historical data repositories for users to train and fit their own models.

Open-Track Data + Software Winner

Dan Morse

Dan Morse (view paper)

Dan was born and raised just outside the city of Seattle, and still lives there with his wife and three children. He works for a small business by day and spends his free time in the evenings on statistical analysis across multiple sports. His journey into sports analytics began five years ago when he decided to learn how to code to better analyze hockey games, and has since grown into several side projects involving football apps, baseball twitter bots, and of course an R package to make it easier to analyze hockey games.

hockeyR: Easy access to detailed NHL play-by-play data

Public sports analytics largely hinges on access to as much raw and detailed play-by-play data as can be made available. Football data analysis in the public sphere has never been easier to get started in since the advent of nflscrapR (and subsequently nflfastr) but a version for hockey doesn’t really exist. There have been multiple scrapers created and shared online in the past, but the creators generally take them down or can’t keep them updated because they keep getting hired away by professional sports teams. The hockeyR package seeks to not only share a publicly available set of functions to scrape NHL play-by-play data but to also keep an up-to-date folder of easily accessible play-by-play data in the form of both .rds and .csv files. This will lower the barrier for entry into the world of hockey analytics, allowing any analyst to be able to access cleaned and complete play-by-play data with a single function call (or even by downloading a .csv file to their own computer manually) without the hassle of running a scraper themselves. Data for every season going back to 2010-11 is already available now on GitHub, and future seasons will be automatically updated with GitHub Actions.

Open-Track Methods Finalists

Robert Binion

Robert Binion (view paper)

Georgia Tech

Robert Binion grew up in Atlanta, GA and now lives in Salt Lake City, UT with his wife and two children. He does data analysis and software quality engineering within a data lake team for a large restaurant corporation. Previously, he spent over ten years in non-profit and church work before coming to the world of data. His foray into the world of sports analytics started with charting play by play data for Georgia Tech football games and has expanded from there. He currently writes analytics-driven football content for Football Outsiders and From The Rumble Seat. He has a B.S. in Chemical Engineering from Georgia Tech and is currently pursuing a Master of Science in Analytics at Georgia Tech. Robert enjoys watching football, reading across all sorts of genres, and spending time in the mountains with his family.

The Racial Imbalance in College Football Coaching

College football has had a race problem since its inception. From issues of team integration to paying student athletes to demographics of the coaching community, race is never far afield from the preeminent issues that permeate the college football landscape. This paper presents a data-driven look at just how inequitable the current racial distribution is in FBS coaching and provides some insights into how this can change. This project uses regression, community detection, and classification models to show that there are persistent racial imbalances in the ranks of college football coaching. The imbalances are not due to performance but are perpetuated by reliance on connections within the coaching community and potentially exacerbated further by implicit or explicit racial bias. To mitigate these imbalances, significant shifts are needed within the paradigms that govern hires made at the highest level of the sport.

Mark Wood

Mark Wood (view paper)

Georgia Tech

Mark Wood is from Atlanta, GA and is a design & construction Project Manager for Georgia Tech. He has previously worked on commercial and industrial projects throughout the southeast U.S. and in South Africa. He has a B.S. in Civil Engineering from Georgia Tech and is currently pursuing a Master of Science in Analytics at Georgia Tech. In his free time he enjoys spending time with his wife and two kids, watching all sports except baseball, and going to concerts.

The Racial Imbalance in College Football Coaching

College football has had a race problem since its inception. From issues of team integration to paying student athletes to demographics of the coaching community, race is never far afield from the preeminent issues that permeate the college football landscape. This paper presents a data-driven look at just how inequitable the current racial distribution is in FBS coaching and provides some insights into how this can change. This project uses regression, community detection, and classification models to show that there are persistent racial imbalances in the ranks of college football coaching. The imbalances are not due to performance but are perpetuated by reliance on connections within the coaching community and potentially exacerbated further by implicit or explicit racial bias. To mitigate these imbalances, significant shifts are needed within the paradigms that govern hires made at the highest level of the sport.

Quang Nguyen

Quang Nguyen (view paper)

Carnegie Mellon University

Quang Nguyen is a first year PhD student in the Department of Statistics & Data Science at Carnegie Mellon University. He previously completed his MS in Applied Statistics at Loyola University Chicago and BS in Mathematics & Data Science at Wittenberg University in Springfield, Ohio. Quang is broadly interested in applications of statistics and machine learning in sports. He is a die-hard supporter of Manchester United F.C. of the English Premier League.

Estimating Aging Curves: Using Multiple Imputation to Examine Career Trajectories of MLB Offensive Players

In sports, an aging curve depicts the relationship between average performance and age in athletes’ careers. This paper investigates the aging curves for offensive players in the Major League Baseball. We study this problem in a missing data context and account for different types of dropouts of baseball players during their careers. In particular, the performance metrics associated with the missing seasons are imputed using a multiple imputation model for multilevel data, and the aging curves are constructed based on the imputed datasets. We first perform a simulation study to evaluate the effects of different dropout mechanisms on the estimation of aging curves. Our method is then illustrated with analyses of MLB player data from past seasons. Results suggest an overestimation of the aging curves constructed without imputing the unobserved seasons, whereas a better estimate is achieved with our approach.

Hassan Rafique

Hassan Rafique (view paper)

University of Indianapolis

Hassan Rafique is an assistant professor at the University of Indianapolis Department of Mathematical Sciences. His research work has been at the intersection of optimization and machine learning. In addition, he is particularly interested in sports analytics, data for social good, and data visualization. Outside work, he enjoys being outdoors, playing and watching cricket, and letting his new favorite sport, American football, consume his Sundays.

cricWAR: A reproducible system for evaluating player performance in limited-overs cricket

Cricket statistics mainly comprise simple averages (batting avg., bowling avg., strike rates, etc.) to account for player performance. For limited-overs cricket, these statistics fall short of providing a comprehensive picture of a player’s performance and contribution to the team’s success. Due to the rise of T20 cricket leagues, there is significant interest in comprehensive statistics that capture the net value-added by an individual player. Inspired by sabermetrics, we develop metrics such as runs above average (RAA), value over replacement player (VORP), and wins above replacement (WAR) for batters and bowlers in limited-overs cricket. These metrics are calculated using ball-by-ball data readily available through R package cricketdata, co-developed by us. We associate run value with each actual run scored using estimated expected runs. Some positions/situations are more conducive for scoring (or defending) during the innings, and we adjust the run values by using a variation of the Leverage Index. Run values are adjusted for the venue, bowling pace (spin vs. pace), platoon advantage, and innings (first vs. second) using regression to estimate RAA. We assess the uncertainty in RAA and WAR estimates through a resampling and simulation-based approach and present results for the IPL 2019 regular season. Finally, we discuss the further avenues of research and comment on the possible implications of this work for the T20 teams.




CALL FOR POSTER ABSTRACTS

In an effort to foster intellectual growth and discovery among the statistics and data science community, we gladly welcome research submissions from the public.

Submit your research project to present your work as a poster using the form by September 16th. Note that there are limited spaces available, and abstracts for posters will be accepted on a rolling basis until slots are filled. Final acceptance notifications will be sent out by early-October.

Here's a recap of important dates and requirements to remember:

- September 16th: Abstract submission deadline.
- Abstracts will be selected on a rolling basis, final notification by early-October.

NOTE: This research submission form is not considered for entry into the reproducible research competition, meaning it does not require publicly available data and sharing of code.


Our Sponsors

Presenting Sponsor

AvenueFour

Workshop Sponsor

Sumer

Lunch Break Sponsor

Zelus

Coffee Break Sponsor

Penguins

Coffee Break Sponsor

Pirates

Supporting Sponsor

Astros

Please see the sponsorship form for more information about sponsorship opportunities, and contact cmsac@stat.cmu.edu if you have any questions.

You can also directly sponsor a student's registration here!


Contact Us

The Carnegie Mellon Sports Analytics Conference is proudly hosted by the Department of Statistics & Data Science
and the Carnegie Mellon Sports Analytics club.


CMSAC Program Committee:

Carnegie Mellon Sports Analytics Club Executives
  • Eric Hoeffel (President)
  • Jordan Gilbert (Vice President)
  • Quinn Murrary (Officer)
  • Marion Haney (Officer)
Questions can be directed to cmsac@stat.cmu.edu.

CMSAC Activities Conduct Policy

(modeled on the ASA Activities Conduct Policy approved November 30, 2018 by American Statistical Association Board of Directors)

The Carnegie Mellon Sports Analytics Conference (CMSAC) is committed to providing an atmosphere in which personal respect and intellectual growth are valued and the free expression and exchange of ideas are encouraged. Consistent with this commitment, it is CMSAC policy that all participants in CMSAC activities enjoy a welcoming environment free from unlawful discrimination, harassment, and retaliation. We strive to be a community that welcomes and supports people of all backgrounds and identities. This includes, but is not limited to, members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.

All CMSAC participants —including, but not limited to, attendees, statisticians, data scientists, sports analysts, students, registered guests, staff, contractors, sponsors, exhibitors, and volunteers —in the conference or any other related activity—whether official or unofficial—agree to comply with all rules and conditions of the activities. Your registration for or attendance at the 2020 Carnegie Mellon Sports Analytics Conference indicates your agreement to abide by this policy and its terms.


Expected Behavior

- Model and support the norms of professional respect necessary to promote the conditions for healthy exchange of scientific ideas.

- Speak and conduct yourself professionally; do not insult or disparage other participants.

- Be conscious of hierarchical structures in the sports analytics and/or broader statistics/data science community, specifically the existence of stark power differentials among students, junior analysts/statisticians, and senior analysts/statisticians—noting that fear of retaliation from those in senior-level positions can make it difficult for students or those in junior level positions to express discomfort, rebuff unwelcome advances, and report violations of the conduct policy.

- Be sensitive to body language and other non-verbal signals and respond respectfully.


Unacceptable Behavior

- Violent threats or language directed against another person

- Discriminatory jokes and language

- Inclusion of unnecessary sexually explicit, violent, or otherwise sensitive materials in presentations

- Posting (or threatening to post), without permission, other people’s personally identifying information online, including on social networking sites

- Personal insults including, but not limited to, those using racist, sexist, homophobic, or xenophobic terms

- Unwelcome solicitation of emotional or physical intimacy such as sexual advances; propositions; sexual flirtations; sexually-related touching; and graphic gestures or comments about sex or another person’s dress, body, or sexual activities

- Advocating for, encouraging, or dismissing the severity of any of the above behaviors.


Consequences of Unacceptable Behavior

At the sole discretion of the CMSAC Program Committee, unacceptable behavior may result in removal from or denial of access to meeting facilities or activities, without refund of any applicable registration fees or costs. In addition, the CMSAC reserves the right to report violations to an individual’s employer or institution or to a law-enforcement agency. Those engaging in unacceptable behavior may also be banned from future CMSAC activities or face additional penalties.


What to Do if You Witness or Are Subject to Unacceptable Behavior

If you are being harassed, notice that someone else is being harassed, or have any other concerns relating to harassment, please contact a member of the CMSAC program committee either in person or at cmsac@stat.cmu.edu. If you witness potential harm to a conference participant, be proactive in helping to mitigate or avoid that harm; if you see or hear something that concerns you, please say something.


Process for Adjudicating Reports of Misconduct

The CMSAC will contract with an independent entity to manage and adjudicate reported violations of the conduct policy.


Note: This Code of Conduct may be revised at any time by the Carnegie Mellon Sports Analytics Conference. Questions, concerns, or comments should be directed to cmsac@stat.cmu.edu.