Sam Speaking

About The Conference

The Carnegie Mellon Sports Analytics Conference is an annual event dedicated to highlighting the latest sports research from the statistics and data science community.


Check out this year's schedule and speaker information below! You can also view information about our previous 2019, 2020, 2021, 2022, and 2023 conferences.


In-person registration is sold-out, but virtual registration is still available!

Register now!

The registration deadline to attend #CMSAConference is October 25th, or potentially earlier if we run out of space! We will provide printed name tags for in-person participants. In-person attendees will have access to the poster session and networking opportunities throughout the conference. You can also register to attend virtually by the deadline of October 31st. Virtual attendees will be able to attend the workshop and speaker events via a zoom webinar. We can not guarantee that virtual attendees will able to ask questions, since priority will be given to in-person attendees. See below for pricing options.

In-Person Registration (deadline is Oct 25th or earlier subject to availability)

  • High School / Undergrad / Grad students Conference: $15 (with school ID)
  • High School / Undergrad/ Grad students Conference + Workshop: $20 (with school ID)
  • Non-students Conference: $50
  • Non-students Conference + Workshop: $60

Virtual Attendance Registration (deadline is Oct 31st; limited access to engagement and networking opportunities)

  • High School / Undergrad / Grad student Conference + Workshop: FREE (with school ID)
  • Non-students Conference + Workshop: $10
Registering indicates agreement to abide by the Code of Conduct.

Hotel information

We have room blocks with the Hilton Garden Inn and The Oaklander Hotel (reserve now with the respective links).

Conference Location

Carnegie Mellon University
Giant Eagle Auditorium
4909 Frew St, Pittsburgh, PA 15213
From PIT Airport

1. Head northeast on Airport Blvd
2. Keep left to stay on Airport Blvd - 0.6 mi
3. Keep left to stay on Airport Blvd - 0.7 mi
4. Continue straight to stay on Airport Blvd - 0.2 mi
5. Keep left at the fork, follow signs for
I-376 E/I-79 E/Pittsburgh/Pennsylvania Turnpike E and
merge onto I-376 E - 0.6 mi
6. Merge onto I-376 E - 16.4 mi
7. Keep right to stay on I-376 E - 2.1 mi
8. Take exit 72A to merge onto Forbes Ave toward Oakland - 0.3 mi
9. Merge onto Forbes Ave - 1.0 mi
10. Turn right onto Schenley Drive Extension - 449 ft
11. Turn left onto Schenley Drive - 0.2 mi
12. Turn left onto Frew St 0.2 mi
13. Destination will be on the left


Schedule Details

Baseball Analytics Workshop (Giant Eagle Auditorium)

Conference sessions in Giant Eagle Auditorium (times are subject to change)

  • 8:00 AM

    Registration / check-in

  • 8:50 AM

    Welcome and Opening Remarks

    CMU Statistics & Data Science
  • 9:00 AM

    Bye-Bye, Bye Advantage: Estimating the competitive impact of rest differential in the National Football League

    Tom Bliss, NFL
  • 9:30 AM

    Describing Outs Above Average

    Tom Tango, MLB Advanced Media
  • 10:00 AM

    Coffee Break

  • 10:30 AM

    Data Huddle: Insights from the NFL

    Tom Bliss, Zuri Hunter, Clive-Anthony Stanberry
  • 11:00 AM

    Coffee Break

  • 11:15 AM

    SCORE Network Spotlight

  • 11:30 AM

    Introducing CMSACenter

    Ron Yurko
  • 11:45 AM

    Poster Previews

    Poster Presenters
  • 12:00 PM

    Lunch Break and Poster Session

  • 1:15 PM

    CMSACamp Student Presentations

  • 2:05 PM

    Coffee Break

  • 2:20 PM

    CMSACamp Student Presentations

  • 2:50 PM

    Features Importances in Predicting Batted Ball Outcome

    Larry Jiang, Carnegie Mellon University
  • 3:05 PM

    Coffee Break

  • 3:15 PM

    Reproducible Research Competition

    Student track finalists
  • 3:55 PM

    Coffee Break and Voting

  • 4:10 PM

    Reproducible Research Competition

    Open track finalists
  • 4:50 PM

    Coffee Break and Voting

  • 5:00 PM

    Awards and Closing Remarks

  • 5:30 to 7:30 PM

    Networking Reception (Highmark Center)


Invited Talks

Thompson Bliss

Thompson Bliss

National Football League

Thompson Bliss is a Senior Manager, Football Operations, Data Scientist for the National Football League. He started at the NFL in February 2020 as a Data Scientist and was promoted to his current role in January 2024. He completed his master’s degree in Data Science at Columbia University in the City of New York in December 2019. He received a Bachelor of Science in Physics and Astronomy with minors in Mathematics and Computer Science at University of Wisconsin - Madison in 2018.

Bye-Bye, Bye Advantage: Estimating the competitive impact of rest differential in the National Football League

The National Football League (NFL) sets its regular season schedule to optimize viewership and minimize competitive inequities. One inequity assumed to impact team performance is rest differential, defined as the relative number of days between games. Using Bayesian state space models on both game outcomes and betting market data, we estimate the competitive effect of rest differential in American football. We find that the most commonly referred to inequities -- both the bye week rest advantage and the mini-bye week rest advantage -- currently show no significant evidence of providing the rested team a competitive edge. Further, we trace a decline in the advantage of a bye week to a 2011 change to the NFL's Collective Bargaining Agreement, which represents a natural experiment to test the relevance of rest and preparation in football. Prior to the agreement, NFL teams off a bye week received a significant advantage (+2.2 points per game), but since 2011, that benefit has been mitigated.

Tom Tango

Tom Tango

MLB Advanced Media

Tom Tango spent more than a decade as a data analyst for MLB and NHL clubs before joining MLB Advanced Media in 2016 as their first Senior Database Architect of Stats, with a focus on creating metrics for BaseballSavant.com using Statcast data. Catch Probability, Sprint Speed, Barrels, Infield Outs Above Average are among the next generation of metrics that are intriguing fans. He is co-author of the sabermetrics book The Book: Playing the Percentages in Baseball. Among the metrics he has created are Leverage Index, FIP, and wOBA, along with developing the framework for WAR.

Describing Outs Above Average

Measuring fielding in baseball has historically lagged in comparison to batting, running, and pitching. Since the introduction of Statcast in MLB ballparks in 2015, measuring fielding has taken a giant leap forward: Outfield Catch Probability was released in 2017, with Infield Outs Above Average (OAA) in 2020. Using an intuitive mathematically-centric model, we can describe how OAA works, as well as the challenges ahead.

Baseball Analytics Workshop Speakers

Jim Albert

Jim Albert

Jim Albert is Emeritus Distinguished University Professor, Bowling Green State University. He has interests in Bayesian modeling, statistics education and the application of statistical thinking in sports. He is author or coauthor of Curve Ball, Visualizing Baseball, Teaching Statistics Using Baseball, and Analyzing Baseball Data with R, 3rd edition. He regularly contributes to the blog Exploring Baseball Data with R.

David Dalpiaz

David Dalpiaz

David Dalpiaz is a Teaching Associate Professor for the Siebel School of Computing and Data Science at the University of Illinois at Urbana-Champaign. He teaches both undergraduate and graduate courses in data science with an emphasis on applied machine learning. His research is currently focused on accessible education and baseball analytics.

Data Huddle: Insights from the NFL

Thompson Bliss

Thompson Bliss

Thompson Bliss is a Senior Manager, Football Operations, Data Scientist for the National Football League. He started at the NFL in February 2020 as a Data Scientist and was promoted to his current role in January 2024. He completed his master’s degree in Data Science at Columbia University in the City of New York in December 2019. He received a Bachelor of Science in Physics and Astronomy with minors in Mathematics and Computer Science at University of Wisconsin - Madison in 2018.

Zuri Hunter

Zuri Hunter

Zuri Hunter, a data engineer at the National Football League, merges her technical skills with a strong commitment to community service. A Computer Information Systems graduate from the illustrious Howard University, she taught herself Ruby on Rails and showcased her talents in hackathons, notably at the United States Presidential Innovation Fellows Hack the Pay Gap event. Outside of her engineering role, Zuri uses to volunteer as a Technical Lead for Black Girls Code and organize Howard University’s hackathon “Bison Hacks.” Her work earned her a 2018 DCA Live Power in Tech nomination. In her downtime, she enjoys arts and crafts, ice skating and competing in fighting game tournaments nationwide.

Clive-Anthony Stanberry

Clive-Anthony Stanberry

Clive-Anthony is a Data Quality Engineer in the Player Health & Safety division at the National Football League (NFL). He specializes in enhancing data accuracy and fostering collaboration among athletic trainers and management across all NFL clubs. Clive-Anthony engineered and maintains the NFL Data Quality & Compliance Portal, assisting trainers & biomedical groups keep high data quality standards. He holds an MSc in Data Science and Artificial Intelligence from the University of London, and a B.Sc. in Mathematics and Economics from the London School of Economics. At the Carnegie Mellon Sports Analytics Conference, he looks forward to sharing his insights on leveraging data science and AI to improve player health and safety.

Predict Hits with New MLB Swing Data Competition Speaker

Larry Jiang

Larry Jiang

Ruitong "Larry"" Jiang is currently a 4th year PhD student at Carnegie Mellon University studying system neuroscience. His research focuses on the time-of-day fluctuation of vigilance performance. Outside of his research work, he has a huge interest in learning new statistical techniques in sports analytics and enjoys experimenting with some new statistical techniques on his research projects. He was the Finalist in 2024 Big Data Bowl and placed 5th in the Kaggle Competition "Predict hits with New MLB Swing Data". In addition, he’s a regular at PNC Park.

Features Importances in Predicting Batted Ball Outcome

In baseball, it’s hard to know what will happen to the baseball the moment it leaves the bat. The "Predict Hits with New MLB Swing Data" Kaggle competition aimed to address this challenge by leveraging detailed pitch and bat tracking data to predict the probability of different batted ball outcomes. In this competition, I derived a multitude of new features from the original dataset, expanding the feature set to over 90 variables. Using CatBoost, I identified the optimal combination of features that delivered a strong AUC score. The final model, evaluated on 95% of the test data, demonstrated even higher predictive performance. Notably, the interactions between features revealed the critical importance of how batters cover the horizontal plane of the plate in determining batted ball outcomes.

CMSACamp 2024 Student Speakers

Gabriel Eze

Gabriel Eze

Centre College

I am currently a junior at Centre College, double majoring in Mathematics and Data Science. I love playing soccer and video games—they are my escape from the constant demands of being a college student, especially as an international student. I hail from Nigeria in Sub-Saharan Africa, and I am the third of four boys. I’m fortunate to have the opportunity to pursue tertiary education here in the U.S. I like to think of myself as an explorer and scientist, but it’s not until now that I’ve gotten the pathway to step outside the confines of my little castle in The Bells neighborhood, Otta. I’ve learned and grown immensely since my introduction to Western life here at Centre College Danville, KY, and the pursuit of the American dream. I remember my father always telling me that mathematics is the easiest subject because it never lies. I quickly developed a love for algebraic equations and word problems, which sparked my interest in Data Science and Analytics. This passion drives me today, and I am excited to continue exploring the ways data can reveal truths and shape the future.

Weapons of Best Production: Predicting the Optimal Pitch Arsenal Adjustment for Superior Stuff+

Baseball is a game of constant alteration and evolution, influenced by the innovations of each generation. Players and teams are always exploring tactics to find an edge over their adversaries. Even at the pinnacle of their success, the best players will scout out new competitive advantages to improve their performance. One such advantage lies in a pitcher’s ability to adjust his pitching arsenal. Thanks to technological developments, baseball players, teams, and fans have access to unparalleled amounts of data surrounding every pitch thrown during a season. Pitchers can add a pitch and quickly gauge its effectiveness. But what if there was a way to determine a pitch’s effectiveness before it was ever thrown? The goal of our project was to construct a pitch recommendation system that suggests with conviction the best pitch for a pitcher to add to their arsenal. We approached this task by examining the characteristics of other pitches in an arsenal to generate Stuff+ predictions, which indicate the potential success of pitches not yet thrown by the pitcher. These predictions would allow players and teams to be proactive rather than reactive in making changes to pitching arsenals and strategies.

Neha Kotha

Neha Kotha

University of Pittsburgh

My name is Neha Kotha, and I am a junior at the University of Pittsburgh. I am currently studying data science with a minor in computer science and a digital media certificate. On campus, I am actively involved in the Sports Analytics Club, Women in Computer Science Club, and Smart Women Securities, an investment portfolio club. I also serve as a Forté Foundation campus ambassador, an organization that promotes and supports More Women Leading, and as a board member of the Bodybuilding Club at Pitt. I currently work as a Stadium Event and Operations Intern at the Pittsburgh Steelers/Acrisure Stadium where I work Pitt and Steelers home games. I aspire to work in the sports analytics or sports media analytics world in the future.

Weapons of Best Production: Predicting the Optimal Pitch Arsenal Adjustment for Superior Stuff+

Baseball is a game of constant alteration and evolution, influenced by the innovations of each generation. Players and teams are always exploring tactics to find an edge over their adversaries. Even at the pinnacle of their success, the best players will scout out new competitive advantages to improve their performance. One such advantage lies in a pitcher’s ability to adjust his pitching arsenal. Thanks to technological developments, baseball players, teams, and fans have access to unparalleled amounts of data surrounding every pitch thrown during a season. Pitchers can add a pitch and quickly gauge its effectiveness. But what if there was a way to determine a pitch’s effectiveness before it was ever thrown? The goal of our project was to construct a pitch recommendation system that suggests with conviction the best pitch for a pitcher to add to their arsenal. We approached this task by examining the characteristics of other pitches in an arsenal to generate Stuff+ predictions, which indicate the potential success of pitches not yet thrown by the pitcher. These predictions would allow players and teams to be proactive rather than reactive in making changes to pitching arsenals and strategies.

Danny Nolan

Danny Nolan

Bucknell University

Danny Nolan is a senior at Bucknell University majoring in Statistics and minoring in Economics and Classics. On campus, Danny has worked for the Math Department for parts of three years, both as a research assistant and a course grader. He is also a captain of Bucknell’s Ultimate Frisbee team. Danny is an avid Philadelphia sports fan, and he hopes to use his enthusiasm for sports and his aptitude for statistics to work in sports analytics after earning his undergraduate degree.

Weapons of Best Production: Predicting the Optimal Pitch Arsenal Adjustment for Superior Stuff+

Baseball is a game of constant alteration and evolution, influenced by the innovations of each generation. Players and teams are always exploring tactics to find an edge over their adversaries. Even at the pinnacle of their success, the best players will scout out new competitive advantages to improve their performance. One such advantage lies in a pitcher’s ability to adjust his pitching arsenal. Thanks to technological developments, baseball players, teams, and fans have access to unparalleled amounts of data surrounding every pitch thrown during a season. Pitchers can add a pitch and quickly gauge its effectiveness. But what if there was a way to determine a pitch’s effectiveness before it was ever thrown? The goal of our project was to construct a pitch recommendation system that suggests with conviction the best pitch for a pitcher to add to their arsenal. We approached this task by examining the characteristics of other pitches in an arsenal to generate Stuff+ predictions, which indicate the potential success of pitches not yet thrown by the pitcher. These predictions would allow players and teams to be proactive rather than reactive in making changes to pitching arsenals and strategies.

Ian A. Pérez

Ian A. Pérez

University of Arizona

Ian is a senior at the University of Arizona pursuing a dual degree: a BS in Mathematics, and a BA in Spanish Translation and Interpretation, with minors in Sociology and Data Science. Outside of academia he is passionate about sports and social justice. He is the vice president of the University of Arizona Sports Analytics Club and works providing translation services for the Hispanic-serving volunteer programs at the University of Arizona's Cooperative Extension. He is interested in applying statistical methods to social science research and after graduation he plans to pursue a PhD in Statistics or Sociology.

Killer Defense: Evaluating Individual Defensive Contributions on the Penalty Kill

Despite the offense’s player advantage and increased goalscoring rate making the power play’s importance widely recognized, the penalty kill side of play has not attracted as much attention from the analytics community as the offensive side has. Similarly, research of women’s ice hockey has lagged behind that of the men’s game. This project addresses both of those gaps by focusing on analyzing defensive contributions on the penalty kill in international women’s ice hockey. Using both player-tracking and event-level data, we use XGBoost to build a pass-completion probability model. The model allows us to evaluate the impact of individual defensive players on the penalty kill and analyze team-wide offensive risk-taking tendencies on the power play.

Frithjof Sanger

Frithjof Sanger

Carnegie Mellon University

Frithjof Sanger is a junior at Carnegie Mellon University pursuing a Dual Degree between Statistics and Machine Learning and Business Administration with a concentration in Finance. He is passionate about work at the intersection of the quantitative sciences, computer science and applied areas such as sports analytics and finance. Particularly interested in the fields of applied data science, AI/ML, sports analytics, computational finance, portfolio management, and software engineering.

Killer Defense: Evaluating Individual Defensive Contributions on the Penalty Kill

Despite the offense’s player advantage and increased goalscoring rate making the power play’s importance widely recognized, the penalty kill side of play has not attracted as much attention from the analytics community as the offensive side has. Similarly, research of women’s ice hockey has lagged behind that of the men’s game. This project addresses both of those gaps by focusing on analyzing defensive contributions on the penalty kill in international women’s ice hockey. Using both player-tracking and event-level data, we use XGBoost to build a pass-completion probability model. The model allows us to evaluate the impact of individual defensive players on the penalty kill and analyze team-wide offensive risk-taking tendencies on the power play.

Christina Vu

Christina Vu

Texas Christian University

Christina Vu is a senior at the Texas Christian University studying Mathematics and Economics. Her current interest is in Bayesian statistics and its applications to sports analytics and data-driven decision-making. After graduation, she plans to pursue a Ph.D. in Statistics and a career in quantitative research. Outside the classroom, she is a volunteer pianist who enjoys sharing her love of music with her community.

Killer Defense: Evaluating Individual Defensive Contributions on the Penalty Kill

Despite the offense’s player advantage and increased goalscoring rate making the power play’s importance widely recognized, the penalty kill side of play has not attracted as much attention from the analytics community as the offensive side has. Similarly, research of women’s ice hockey has lagged behind that of the men’s game. This project addresses both of those gaps by focusing on analyzing defensive contributions on the penalty kill in international women’s ice hockey. Using both player-tracking and event-level data, we use XGBoost to build a pass-completion probability model. The model allows us to evaluate the impact of individual defensive players on the penalty kill and analyze team-wide offensive risk-taking tendencies on the power play.

Liam Jennings

Liam Jennings

Robert Morris University

Liam Jennings is a senior at Robert Morris University majoring in Statistics and Data Science with a minor in Sport Management and pursuing a Master's in Data Science. In addition to his coursework, he is the Sport Analytics Club Co-Founder and President, Analytics Club Vice President, an Honors First-Year Seminar Program (FYSP) Mentor, and Men's Club Volleyball Captain. During his free time, he enjoys competing in data challenges, watching sports with friends, and dunking on his apartment mini-hoop. After graduation, he is excited to start a career in data science, with aspirations to be in the sport industry.

The Wrong Stuff

In the fast-paced environment of Major League Baseball, Stuff+ serves as a vital metric for evaluating the ‘nastiness’ of a pitch. The metric analyzes the physical characteristics of a pitch including velocity, spin, extension, and movement. A Stuff+ value of 100 means a pitch is considered league average; anything above or below 100 is considered above or below average respectively. With the emergence of Stuff+ as a common metric used in pitcher evaluation, we wanted to evaluate how accurately the model predicted a pitcher’s success and how appropriately it weighed the factors that go into it. We were curious to know if there were commonalities between the pitches that Stuff+ tended to over or undervalue, telling us it was not accounting for variables that played a role in a pitcher’s effectiveness. Our goal is to provide insights to improve the Stuff+ model so it becomes a more reliable metric for player evaluation. Additionally, we can help pitchers develop effective pitches by understanding which physical pitch characteristics indicate successful outcomes.

Belle Schmidt

Belle Schmidt

St Olaf College

Belle Schmidt is a Junior at St Olaf College majoring in Mathematics with concentrations in Statistics and Data Science, and Management. At St Olaf, she is heavily involved in the athletic community. She is a member of the Varsity softball team, does data analytics for the football team, and is the president of the Sports Analytics Club. She also participates in intramural sports like broomball, volleyball, and basketball. Outside of academics and athletics, she enjoys listening to music, trying new restaurants, and traveling. After graduation Belle hopes to work as a data analyst in the sports industry.

The Wrong Stuff

In the fast-paced environment of Major League Baseball, Stuff+ serves as a vital metric for evaluating the ‘nastiness’ of a pitch. The metric analyzes the physical characteristics of a pitch including velocity, spin, extension, and movement. A Stuff+ value of 100 means a pitch is considered league average; anything above or below 100 is considered above or below average respectively. With the emergence of Stuff+ as a common metric used in pitcher evaluation, we wanted to evaluate how accurately the model predicted a pitcher’s success and how appropriately it weighed the factors that go into it. We were curious to know if there were commonalities between the pitches that Stuff+ tended to over or undervalue, telling us it was not accounting for variables that played a role in a pitcher’s effectiveness. Our goal is to provide insights to improve the Stuff+ model so it becomes a more reliable metric for player evaluation. Additionally, we can help pitchers develop effective pitches by understanding which physical pitch characteristics indicate successful outcomes.

Cole Siniawski

Cole Siniawski

Denison University

Cole Shegan Siniawski is a junior at Denison University double majoring in Data Analytics and Economics. He is extremely passionate about hockey and sports analytics through continuing work and research in sports analytics focused on hockey. In his free time, you can find him at the ice rink coaching youth hockey and playing adult league hockey, in addition to watching hockey games with friends. After graduation, he plans on pursuing a career in data analytics, whether it is doing sports analytics for a professional team or financial analytics.

Analyzing Consistency Among NHL Forwards

In hockey, consistency, the quality of being able to maintain a high level performance, is the hallmark of importance. The idea of consistency is vital because it allows teams to make well-informed decisions about player investment that eventually lead to team on-ice success. Our main goal was to measure the consistency among NHL forwards from an offensive perspective. Through our analysis on the league’s seasonal data from the past eight seasons, we developed a measure of offensive consistency and identified the top players in the league on that basis. The data suggests that teams pay players based on their offensive production and therefore might be undervaluing forwards who are more consistent.

Lucca Ferraz

Lucca Ferraz

Rice University

Lucca Ferraz is a junior at Rice University pursuing a double major in Sport Analytics and Statistics as well as minors in Data Science and Financial Computation and Modeling. During the school year he works as a Recruiting, Scouting, and Analytics intern for Rice Football, and he previously interned for the agency Montgomery Sports Group as a Basketball Analytics Intern. In his free time, Lucca enjoys playing pickup basketball, chess, and the tuba. His long-term career goal is to work in the analytics department of a professional sports team.

Examining batted passes in the NFL: A hierarchical approach to explaining variance of an unlikely event

Batted passes occur when a quarterback throws the ball and a defender hits it down, typically around the line of scrimmage. In this project, we attempted to isolate the most important variables that explain the variance of batted passes in the NFL. Oftentimes shorter quarterback prospects are dismissed due to the notion that they “cannot see over the defense” and that their passes will get batted down more often than taller quarterbacks. One main motivation for this project was determining whether quarterback height actually matters in explaining batted passes or if this is simply an example of unfounded bias against short QBs.


Reproducible Research Competition

Open-Track Methods Finalists

Ryan Brill

Ryan Brill (view paper)

University of Pennsylvania

Ryan Brill is a fifth (and final) year PhD student studying applied math & statistics at the University of Pennsylvania. He is broadly interested in decision making under uncertainty and real-world applications of statistics and data science. He also works part-time for the Dodgers and was a 2022 NFL Big Data Bowl Finalist. Outside of work, he has recently been enjoying watching Industry season 3, reading about moral psychology, and exploring New York City.

The winner of the NFL draft is not necessarily cursed: Exploring the discrepancy between NFL draft expected value curves and the observed trade market

Football analysts traditionally value a future draft pick position by its expected performance or surplus value. But, these expected value curves do not match the valuation implied by the observed trade market. One takeaway is general managers are making terrible trades on average. An alternative explanation is they are using some other value function that captures an essential piece of the puzzle missing from previous analyses. We are partial to the latter explanation. In particular, traditional analyses don’t consider how variance in performance outcomes changes over the draft. Because variance decays convexly accross the draft, eliteness (e.g., right tail probability) decays much more steeply than expected value. We suspect general managers value performance nonlinearly, placing exponentially higher value on players as their eliteness increases. This is because elite players have an outsize influence on winning the Super Bowl. Thus, in this paper we consider nonlinear draft value curves that capture the outsize influence of elite players. Such nonlinear value functions produce steeper draft value curves that more closely resemble the observed trade market.

Lee Kennedy-Shaffer

Lee Kennedy-Shaffer (view paper)

Yale School of Public Health

Lee Kennedy-Shaffer is an assistant professor of biostatistics at the Yale School of Public Health. His primary research interests are in infectious disease and vaccine study design, along with the methodology of cluster-randomized trials and quasi-experiments. A lifelong Mets fan, he is also interested in using these methods to understand baseball and sports more generally, identifying what causal inference methods work in sports settings, and using sports to broaden interest in statistics to students and the wider public.

Panel Data Methods to Evaluate the Impact of Rule Changes

In recent years, several major team sports have instituted rule changes in attempts to improve game play and the viewing experience. From 2020 to 2023, Major League Baseball instituted several rule changes affecting team composition, player positioning, and game time. Understanding the effect of these rules—both on the game as a whole and on individual teams and players—is crucial for leagues, teams, players, and other relevant parties to assess their impact and either push for further changes or to roll back existing rules. Panel data and quasi- experimental methods provide useful tools for causal inference in these settings. I demonstrate this potential by analyzing the effect of the 2023 shift ban at both the overall and player-specific levels. Using difference-in-differences analysis, I show that the policy increased BABIP and OBP for left-handed batters by a modest amount. For individual players, synthetic control analyses identify several players whose offensive performance (OBP, OPS, and wOBA) improved significantly because of the rule change, and other players with previously high shift rates for whom it had little effect. This work both estimates the impact of this specific rule change and demonstrates how these methods for causal inference are potentially valuable for sports analysis—at the player, team, and league levels—more broadly.

Open-Track Data + Software Winner

Brandon Onyejekwe

Brandon Onyejekwe (view paper)

Brandon Onyejekwe is a recent graduate from Northeastern University with a Bachelor of Science in Data Science and a minor in Mathematics. His main research interests broadly involve solving real world problems using machine learning methods. While an avid sports fan overall, his main passion falls with running, as he was a captain of Northeastern’s club running team, a member of the Sports Analytics Club, and is currently half marathon training. He currently works as a Data Engineer at Travelers Insurance in Hartford, CT.

Quantifying Uncertainty in Marathon Finish Time Predictions

In the middle of a marathon, a runner’s expected finish time is commonly estimated by extrapolating the average pace covered so far, assuming it to be constant for the rest of the race. These predictions have two key issues: the estimates do not consider the in-race context that can determine if a runner is likely to finish faster or slower than expected, and the prediction is a single point estimate with no information about uncertainty. We implement two approaches to address these issues: Bayesian linear regression and quantile regression. Both methods incorporate information from all splits in the race and allow us to quantify uncertainty around the predicted finish times. We utilized 15 years of Boston Marathon data (312,805 runners total) to evaluate and compare both approaches. Finally, we developed an app for runners to visualize their estimated finish distribution in real time.

Student-Track Data + Software Winner

Bhaskar Lalwani

Bhaskar Lalwani (view paper)

Kalinga Institute of Industrial Technology

Bhaskar is a junior pursuing a Bachelor's in Computer Science at Kalinga Institute of Industrial Technology. With a focus on deep learning and data science, he has experience working with transformer-based architectures, applying them to tasks such as optical character recognition for multiple languages and fine-tuning models for Indic languages. He is looking to gain more exposure in his interests which include intersection of AI, language and linguistics. He plans to pursue a MS in CS after graduation. In his free time, Bhaskar enjoys playing the tabla and piano, and reads science fiction.

KabaddiPy: A package to enable access to Professional Kabaddi Data

Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with over sixteen international teams participating, alongside flourishing national leagues such as the Indian Pro Kabaddi League (230 million viewers) and the British Kabaddi League. We present the first open-source Python module to make Kabaddi statistical data easily accessible from multiple scattered sources across the internet. The module was developed by systematically web-scraping and collecting team-wise, player-wise and match-by-match data. The data has been cleaned, organized, and categorized into team overviews and player metrics, each filterable by season. The players are classified as raiders and defenders, with their best strategies for attacking, counter-attacking, and defending against different teams highlighted. Our module enables continuous monitoring of exponentially growing data streams, aiding researchers to quickly start building upon the data to answer critical questions, such as the impact of player inclusion/exclusion on team performance, scoring patterns against specific teams, and break down opponent gameplay. The data generated from Kabaddi tournaments has been sparsely used, and coaches and players rely heavily on intuition to make decisions and craft strategies. Our module can be utilized to build predictive models, craft uniquely strategic gameplays to target opponents and identify hidden correlations in the data. This open source module has the potential to increase time-efficiency, encourage analytical studies of Kabaddi gameplay and player dynamics and foster reproducible research. The data and code are publicly available: https://github.com/kabaddiPy/kabaddiPy

Aniruddha Mukherjee

Aniruddha Mukherjee(view paper)

Kalinga Institute of Industrial Technology

Aniruddha Mukherjee is currently a junior majoring in Computer Science at Kalinga Institute of Industrial Technology (KIIT), where he ranks at the top of his class. He is also pursuing a BS in Data Science with the Indian Institute of Technology, Madras (IIT-M) in an online format. He has interned at various research institutions like BITS Pilani, Tata Consultancy Services Research and The University of Texas at Austin. He is passionate about solving problems and has explored solutions in quantitative finance, healthcare, anomaly-detection and image quality assessment leading to presentations and publications at venues like IEEE Transactions, ACM's International Conference on AI in Finance (ICAIF'24) and Springer’s Cognitive Computation. Aniruddha’s drive to build and create impactful solutions has led him to win three hackathons hosted by Indian Institutes of Technology (IITs) and co-author two filed patents on real-world solutions using AI. He also has been working closely with SkinAI, a New Delhi based startup, and with IIT-Kharagpur as a collaborator with the Department of Artificial Intelligence. Outside of academics, he is a Grade 8 pianist (ABRSM), enjoys playing football, tennis and chess, and enjoys debating. Aniruddha has volunteered for Stanford as an Instructor (CS106A) to teach CS basics. He is enthusiastic about utilizing technology and engineering to make a significant and meaningful impact in the lives of individuals. Looking ahead, he is interested in pursuing a Master’s in Computer Science (MSCS) followed by a PhD.

KabaddiPy: A package to enable access to Professional Kabaddi Data

Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with over sixteen international teams participating, alongside flourishing national leagues such as the Indian Pro Kabaddi League (230 million viewers) and the British Kabaddi League. We present the first open-source Python module to make Kabaddi statistical data easily accessible from multiple scattered sources across the internet. The module was developed by systematically web-scraping and collecting team-wise, player-wise and match-by-match data. The data has been cleaned, organized, and categorized into team overviews and player metrics, each filterable by season. The players are classified as raiders and defenders, with their best strategies for attacking, counter-attacking, and defending against different teams highlighted. Our module enables continuous monitoring of exponentially growing data streams, aiding researchers to quickly start building upon the data to answer critical questions, such as the impact of player inclusion/exclusion on team performance, scoring patterns against specific teams, and break down opponent gameplay. The data generated from Kabaddi tournaments has been sparsely used, and coaches and players rely heavily on intuition to make decisions and craft strategies. Our module can be utilized to build predictive models, craft uniquely strategic gameplays to target opponents and identify hidden correlations in the data. This open source module has the potential to increase time-efficiency, encourage analytical studies of Kabaddi gameplay and player dynamics and foster reproducible research. The data and code are publicly available: https://github.com/kabaddiPy/kabaddiPy

Student-Track Methods Finalists

Zeke Weng

Zeke Weng(view paper)

University of Toronto

Zeke is a second-year student at the University of Toronto, studying Computer Science and Statistics with a focus on Artificial Intelligence. He is keenly interested in machine learning and aims to gain more experience in multi-agent systems, reinforcement learning, and algorithmic game theory. Zeke intends to graduate in 2026 and then pursue graduate studies back home in California. Outside of his studies, he helps lead the University of Toronto Sports Analytics student group and has his sights set on the NFL Big Data Bowl after this conference.

Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions

In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Victor Hau

Victor Hau (view paper)

University of Toronto

Victor is a third year Engineering Science student at the University of Toronto majoring in mathematics, statistics and finance. As an engineering student, he is interested in applying his knowledge to exciting real-world problems such as that of sports analytics. Above all, Victor enjoys using data and modeling to drive decision-making and operations research. He plans to pursue a career in a data-centric role in data science or other quantitative fields for his co-op year as well as after graduation. In addition to his academic focuses, he is an avid follower of the NFL, NBA and NHL (go Leafs!).

Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions

In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Ethan Baron

Ethan Baron(view paper)

New York University

Ethan recently started a PhD in Computer Science at New York University working on machine learning. He completed his undergraduate degree at the University of Toronto in Engineering Science, where he led the University of Toronto Sports Analytics student group. Ethan has also worked on soccer analytics as a data scientist at Zelus Analytics, and has presented his sports analytics research at NESSIS, MathSport, and CORS. Outside of work, he is a passionate road cycling fan, and enjoys playing volleyball, basketball, and ultimate frisbee!

Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions

In the past decade, sport climbing has grown to be a popular pastime due to its social, physical and mental stimulation. This growth has been bolstered by its recent addition to the Summer Olympics in three formats: bouldering, speed and lead. In particular, bouldering, a form of climbing that focuses on short, difficult, routes (known as "problems"") with multiple attempts has seen the greatest growth, with 71% of new climbing gyms opening in North America being boulder-focused. Using data from professional bouldering competitions from 2008 to 2022, we train a generalized linear model to predict climber results and measure skill level. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a well-established framework commonly used in recommender systems, to automatically learn to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors.In this framework, a climber’s performance on a given problem can be predicted from the dot product of the corresponding climber vector and problem vectors. Additionally, PMF effectively handles sparse datasets, such as those encountered in competitive bouldering where climbers don't attempt every problem, by extrapolating patterns from similar users, thus inferring information about unobserved interactions. We contrast the empirical performance of PMF to the generalized linear model approach and investigate the learned multivariate representations to gain insights into climber characteristics.

Jacky Jiang

Jacky Jiang (view paper)

Rice University

Hao "Jacky"" Jiang is a driven student at Rice University, pursuing a B.S. in Computer Science and a B.A. in Sport Analytics. With a strong foundation in software development, machine learning, and data science, he has contributed to research projects such as wearable systems for exercise recognition at Cornell University. His internships include building scouting applications for D.C. United and enhancing recommendation algorithms at Petkeley AI Innovations. An active community member, Jacky has volunteered over 100 hours at the Houston Food Bank. Looking ahead, Jacky plans to pursue a Ph.D. in Human-Computer Interaction, aiming to deepen his expertise in the field and contribute to advancing technology for practical and user-centered applications.

GoalNet: Advancing Counterattack Prediction in Soccer through Gender-Specific Graph Neural Networks

Traditional soccer analysis tools emphasize metrics such as chances created and expected goals, leading to an over-representation of attacking players’ contributions and over-looking the pivotal roles of players who facilitate ball control and link attacks. Identifying these players could help coaches develop specific tactics and club recruiting. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we developed a model utilizing graph neural networks (GNN) to analyze match events comprehensively. Our research aims to identify players with pivotal roles in a soccer team using GNNs, incorporating both spatial and temporal features. In our approach, each event in a soccer match is represented as a graph where nodes correspond to players and edges denote interactions. Each node encompasses various attributes, including the player’s name and historical performance metrics such as average pass completion rate. Edges capture interactions between players, such as passes and tackles, with features including pass frequency and distance.We incorporate the last k events to maintain temporal context, accounting for recent interactions. Our model is trained to predict the expected threat (xT) changes for each event, effectively attributing these changes to the contributing players based on their interactions in the previous events. We combine metrics such as degree centrality with the output of the trained GNN model to assign xT changes as credits to players more accurately. To validate the effectiveness of this method, we examined player evaluation outputs, demonstrating that this innovative evaluation method accurately reflects player contributions. Our findings highlight the significance of these pivotal players in the team dynamics, providing a more nuanced understanding of their impact on the game. This comprehensive analysis using GNNs allows for a balanced evaluation of player contributions, showcasing the indispensable roles of facilitators and initiators in soccer matches.

Jerry Cai

Jerry Cai (view paper)

Rice University

Yanxiao Cai is a research assistant and junior software engineer studying computer science at Rice University. Having worked in many research fields, from analyzing large-scale EHR datasets to machine learning model developments for recommendation systems and CTR predictions, he has gathered much experience. Equally comfortable with state-of-the-art techniques like the recurrent neural network, transformer model, and graph neural network, Yanxiao has worked with mainstream frameworks such as PyTorch to raise the accuracy of his predictions. His interest is in machine learning in sports, especially football. Yanxiao works passion-flooded to learn how data collection and processing in industries happen and is committed to applying machine learning to unlock insights in healthcare and sports analytics.

GoalNet: Advancing Counterattack Prediction in Soccer through Gender-Specific Graph Neural Networks

Traditional soccer analysis tools emphasize metrics such as chances created and expected goals, leading to an over-representation of attacking players’ contributions and over-looking the pivotal roles of players who facilitate ball control and link attacks. Identifying these players could help coaches develop specific tactics and club recruiting. Examples include Rodri from Manchester City and Palhinha who just transferred to Bayern Munich. To address this bias, we developed a model utilizing graph neural networks (GNN) to analyze match events comprehensively. Our research aims to identify players with pivotal roles in a soccer team using GNNs, incorporating both spatial and temporal features. In our approach, each event in a soccer match is represented as a graph where nodes correspond to players and edges denote interactions. Each node encompasses various attributes, including the player’s name and historical performance metrics such as average pass completion rate. Edges capture interactions between players, such as passes and tackles, with features including pass frequency and distance.We incorporate the last k events to maintain temporal context, accounting for recent interactions. Our model is trained to predict the expected threat (xT) changes for each event, effectively attributing these changes to the contributing players based on their interactions in the previous events. We combine metrics such as degree centrality with the output of the trained GNN model to assign xT changes as credits to players more accurately. To validate the effectiveness of this method, we examined player evaluation outputs, demonstrating that this innovative evaluation method accurately reflects player contributions. Our findings highlight the significance of these pivotal players in the team dynamics, providing a more nuanced understanding of their impact on the game. This comprehensive analysis using GNNs allows for a balanced evaluation of player contributions, showcasing the indispensable roles of facilitators and initiators in soccer matches.


Poster Abstract Submission

We are no longer accepting poster abstracts.


Our Sponsors

Workshop Sponsor

AvenueFour

Networking Sponsor

Cleat Street

Coffee Break Sponsor

Brewers

Poster Session Sponsor

Sumer

Supporting Sponsor

Pirates

Supporting Sponsor

Teamworks

Supporting Sponsor

Penguins

Supporting Sponsor

Astros

Supporting Sponsor

Astros

Contact Us

The Carnegie Mellon Sports Analytics Conference is proudly hosted by the Department of Statistics & Data Science.


CMSAC Program Committee:

Carnegie Mellon Sports Analytics Club Executives
  • Mihir Mathur (President)
  • Josh Winick (Operations)
  • Jake Sherwindt
  • Abhi Varadarajan
Questions can be directed to cmsac@stat.cmu.edu.

CMSAC Activities Conduct Policy

(modeled on the ASA Activities Conduct Policy approved November 30, 2018 by American Statistical Association Board of Directors)

The Carnegie Mellon Sports Analytics Conference (CMSAC) is committed to providing an atmosphere in which personal respect and intellectual growth are valued and the free expression and exchange of ideas are encouraged. Consistent with this commitment, it is CMSAC policy that all participants in CMSAC activities enjoy a welcoming environment free from unlawful discrimination, harassment, and retaliation. We strive to be a community that welcomes and supports people of all backgrounds and identities. This includes, but is not limited to, members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.

All CMSAC participants —including, but not limited to, attendees, statisticians, data scientists, sports analysts, students, registered guests, staff, contractors, sponsors, exhibitors, and volunteers —in the conference or any other related activity—whether official or unofficial—agree to comply with all rules and conditions of the activities. Your registration for or attendance at the 2024 Carnegie Mellon Sports Analytics Conference indicates your agreement to abide by this policy and its terms.


Expected Behavior

- Model and support the norms of professional respect necessary to promote the conditions for healthy exchange of scientific ideas.

- Speak and conduct yourself professionally; do not insult or disparage other participants.

- Be conscious of hierarchical structures in the sports analytics and/or broader statistics/data science community, specifically the existence of stark power differentials among students, junior analysts/statisticians, and senior analysts/statisticians—noting that fear of retaliation from those in senior-level positions can make it difficult for students or those in junior level positions to express discomfort, rebuff unwelcome advances, and report violations of the conduct policy.

- Be sensitive to body language and other non-verbal signals and respond respectfully.


Unacceptable Behavior

- Violent threats or language directed against another person

- Discriminatory jokes and language

- Inclusion of unnecessary sexually explicit, violent, or otherwise sensitive materials in presentations

- Posting (or threatening to post), without permission, other people’s personally identifying information online, including on social networking sites

- Personal insults including, but not limited to, those using racist, sexist, homophobic, or xenophobic terms

- Unwelcome solicitation of emotional or physical intimacy such as sexual advances; propositions; sexual flirtations; sexually-related touching; and graphic gestures or comments about sex or another person’s dress, body, or sexual activities

- Advocating for, encouraging, or dismissing the severity of any of the above behaviors.


Consequences of Unacceptable Behavior

At the sole discretion of the CMSAC Program Committee, unacceptable behavior may result in removal from or denial of access to meeting facilities or activities, without refund of any applicable registration fees or costs. In addition, the CMSAC reserves the right to report violations to an individual’s employer or institution or to a law-enforcement agency. Those engaging in unacceptable behavior may also be banned from future CMSAC activities or face additional penalties.


What to Do if You Witness or Are Subject to Unacceptable Behavior

If you are being harassed, notice that someone else is being harassed, or have any other concerns relating to harassment, please contact a member of the CMSAC program committee either in person or at cmsac@stat.cmu.edu. If you witness potential harm to a conference participant, be proactive in helping to mitigate or avoid that harm; if you see or hear something that concerns you, please say something.


Process for Adjudicating Reports of Misconduct

The CMSAC will contract with an independent entity to manage and adjudicate reported violations of the conduct policy.


Note: This Code of Conduct may be revised at any time by the Carnegie Mellon Sports Analytics Conference. Questions, concerns, or comments should be directed to cmsac@stat.cmu.edu.