Since October 2015, I have been at Stats Perform and currently serve as Chief Scientist. My goal is to maximize the value of the 35+ years worth of sports data we have. Previously, I was at Disney Research Pittsburgh for 5 years, where I conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. Previous to that, I was a Postdoctoral Researcher at the Robotics Institute at Carnegie Mellon University/Department of Psychology at University of Pittsburgh conducting research on automatic facial expression recognition. I received my BEng(EE) from USQ and my PhD from QUT, Australia in 2003 and 2008 respectively. I was a co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference. Additionally, I have won best paper awards at INTERSPEECH (2007) and WACV (2014) international conferences. My main research interests are in artificial intelligence and interactive machine learning in sporting domains. (www.patricklucey.com)
About The Conference
Now in its fourth year, the Carnegie Mellon Sports Analytics Conference is dedicated to highlighting the latest sports research from the statistics and data science community.
Interested in presenting your research at CMSAC? Submit an abstract using the form below! And if you are using publicly available data then consider entering our third annual Reproducible Research Competition!
Stay tuned for more information about the upcoming #CMSAC20. Check out our 2017, 2018, and 2019 conferences.
Registration is Sold-out!
Registration (until Oct 24th)
- Students: $20
- Non-students: $50
Registering indicates agreement to abide by the Code of Conduct .
Schedule Details
All times displayed are in EDT.
-
11:00 AM
Welcome and Opening Remarks
Rebecca Nugent and Carnegie Mellon Sports Analytics Club> -
11:05 AM
Keynote Address: Patrick Lucey
Stats Perform> -
11:50 AM
Break
-
12:00 PM
Player Chemistry: Striving for a Perfectly Balanced Soccer Team
Lotte Bransen> -
12:25 PM
Points, rating & ranking systems in professional tennis
Paul van Staden> -
12:50 PM
Virtual Poster Session
-
2:15 PM
Measuring Spatial Allocative Efficiency in Basketball
Nathan Sandholtz> -
2:40 PM
Racial Bias in Drafting and Development: The NHL’s Black Quarterback Problem
Chris Watkins> -
3:05 PM
Break
-
3:15 PM
Workshop: Introduction to machine learning with the tidyverse and nflfastR data
Tom Mock> -
4:45 PM
Live Podcast: Chilling With Charlie with Doug Fearing
All times displayed are in EDT.
-
11:00 AM
Welcome and Opening Remarks
Rebecca Nugent and Carnegie Mellon Sports Analytics Club> -
11:05 AM
Keynote Address: Christie Aschwanden
-
11:50 AM
Break
-
12:00 PM
Redefining the Penalty Kick: Does the Punishment Fit the Crime?
CMSACamp:
Bria Cratty and Jack de la Parra> -
12:10 PM
Evaluating Parametric Methods for Modeling European Soccer Team Goals
CMSACamp:
Thea Sukianto and Zhiwei Xiao> -
12:20 PM
Quantifying Passing: Using NBA Tracking Data to Create an Expected Assist Model
CMSACamp:
Raj Dasani, James Hyman, Alex LaGarde, and Caleb Pena> -
12:30 PM
High Anticipation: Exploring Trends Between Public Perception and Player Value
CMSACamp:
Alana Willis, Fiona Dunn, and Sahana Rayan> -
12:40 PM
A Puck Above the Rest: Exploring the Effects of New Data on 2020 NHL Draft Decisions
CMSACamp:
Ashley Mullan and Lucy Ward> -
12:50 PM
Draft Decisions in Uncertain Times: Valuing and Simulating NHL Draft Picks
CMSACamp:
Jill Reiner and Meg Ellingwood> -
1:00 PM
Virtual Poster Session
-
2:00 PM
Live Podcast: Too Many Men with Alexandra Mandrycky
-
2:45 PM
Break
-
3:00 PM
Reproducible Research Competition Finalists
-
3:00 PM
Comparing Free-Throw Forms Among NBA Players Through 3D Similarity Measures
Student track finalist: Paul Ibrahim> -
3:20 PM
Enhancing Public Data Availability and Analysis of Olympic Sports: The Case of College Swimming
Student track finalist: Matthew Flancer> -
3:40 PM
Grinding the Bayes A Hierarchical Modeling Approach to Predicting the NFL Draft
Open track finalist: Benjamin Robinson> -
4:00 PM
Bang the Can Slowly: An Investigation into the 2017 Houston Astros
Open track finalist: Gregory Matthews> -
4:30 PM
Workshop: Leveling Up With The Tidyverse (And Hockey Data)
Meghan Hall>
Conference Keynotes
Depth vs Coverage: Maximizing the Value of Tracking and Event Data for Better Recruitment
Over the last decade, most of the major innovations in sports analytics has centered on utilizing player tracking data. Tracking data has enabled tactical events to be automatically detected, which could not be reliably done by humans previously (e.g., on-ball/off-ball screens in basketball and formations in soccer). Tracking data has also enabled the measurement of quality/difficulty of executing an event (e.g., expected goal/pass/possession value in basketball and soccer), which has allowed for better player and team performance evaluations. Additionally, tracking data has enabled new applications to emerge for both analysis and media purposes such as play retrieval, play simulation (i.e., ghosting), and automatic broadcasting. In the first part of the talk, I will summarize the innovations we have created over the last decade. Although the benefits of utilizing tracking data are clear, when it comes to making recruitment decisions, tracking data is seldom used due to the “lack of coverage”. The small tracking footprint is due to issues that current tracking systems have in scaling (i.e., they currently have to be in every arena, and as well as can’t go back in time). In the second part of the talk, I will talk about circumventing this issue by utilizing computer vision to collect player tracking and event data from broadcast video - highlighting our recent work in collecting tracking data from thousands of historical college basketball games. Even though the ideal state is to have tracking data for every game that has ever been played, there is still enormous value in having event data (i.e., have the time, location and player information logged for every event). In the third and final part of the presentation, I will talk about how we can approximate “tracking-like” metrics from event data and show how it can be utilize to make better player recruitment decisions in soccer.
Christie Aschwanden is the author of GOOD TO GO: What the Athlete in All of Us Can Learn From the Strange Science of Recovery and co-host of EMERGING FORM, a podcast about the creative process. She’s the former lead science writer at FiveThirtyEight and was previously a health columnist for The Washington Post. Christie is a frequent contributor to The New York Times. She’s also been a contributing editor for Runner’s World and a contributing writer for Bicycling. Her work appears in dozens of publications, including Discover, Slate, Consumer Reports, New Scientist, More, Men’s Journal, Mother Jones, NPR.org, Smithsonian and O, the Oprah Magazine. A lifetime athlete, Christie has raced in Europe and North America on the Team Rossignol Nordic ski racing squad. She lives with her husband and numerous animals on a small winery and farm in western Colorado. In her spare time, she enjoys trail running, bicycling, skiing, reading novels, digging in the garden and raising heritage poultry. (christieaschwanden.com).
Telling Stories with Data
How data can inform journalism
Data can provide a powerful tool for explanatory and investigative journalism. This interactive session will demonstrate how journalists use data to find, tell and report stories. I’ll walk through how several stories came together, how I used data to tell the story, and how I worked with FiveThirtyEight’s data team to compile, analyze and visualize data on several reporting projects.
Reading List from Christie Aschwanden
- Science Isn’t Broken
- Failure is Moving Science Forward
- Podcast: Bad Incentives Are Blocking Better Science
- Not Even Scientists Can Easily Explain P-values
- Statisticians Found One Thing They Can Agree On: It’s Time To Stop Misusing P-Values
- How To Tell Good Studies From Bad? Bet On Them
- 200 Researchers, 5 Hypotheses, No Consistent Answers
- What it’s like to be a woman at Sloan
Conference Speakers
Lotte Bransen is a Lead Data Scientist at SciSports, where she leads the Data Analytics team that develops analytical tools to derive actionable insights from soccer data. An avid soccer player herself, Lotte primarily works on developing machine learning models to measure the impact of soccer players’ in-game actions and decisions on the courses and outcomes of matches.
Player Chemistry: Striving for a Perfectly Balanced Soccer Team
Soccer scouts typically ignore the team balance and team chemistry when evaluating potential signings for their teams. Instead, they focus on the individual qualities of the players in isolation. To overcome this limitation of their recruitment process, this talk takes a first step towards objectively providing insight into the question: How well does a team of soccer players gel? In this talk I will introduce two chemistry metrics that measure the offensive and defensive chemistry for a pair of players, respectively. The offensive chemistry metric measures the pair's joint performance in terms of scoring goals, whereas the defensive chemistry metric measures their joint performance in preventing their opponents from scoring goals. Finally, I will explain how these chemistry metrics can be used to build a perfectly balanced soccer team.
Nathan Sandholtz has recently begun a postdoctoral fellowship at the University of Toronto, where he is working on inverse optimization applications in sport with Timothy Chan. In August, he completed his PhD in statistics from Simon Fraser University under the supervision of Derek Bingham and Luke Bornn. Nate is originally from Provo, Utah where he will be returning after the postdoc to join the statistics faculty at Brigham Young University.
Measuring Spatial Allocative Efficiency in Basketball
Every shot in basketball has an opportunity cost; one player’s shot eliminates all potential opportunities from their teammates for that play. For this reason, player-shot efficiency should ultimately be considered relative to the lineup. This aspect of efficiency—the optimal way to allocate shots within a lineup—is the focus of our paper. Allocative efficiency should be considered in a spatial context since the distribution of shot attempts within a lineup is highly dependent on court location. We propose a new metric for spatial allocative efficiency by comparing a player’s field goal percentage (FG%) to their field goal attempt (FGA) rate in context of both their four teammates on the court and the spatial distribution of their shots. Leveraging publicly available data provided by the National Basketball Association (NBA), we estimate player FG% at every location in the offensive half court using a Bayesian hierarchical model. Then, by ordering a lineup’s estimated FG%s and pairing these rankings with the lineup’s empirical FGA rate rankings, we detect areas where the lineup exhibits inefficient shot allocation. Lastly, we analyze the impact that sub-optimal shot allocation has on a team’s overall offensive potential, demonstrating that inefficient shot allocation correlates with reduced scoring.
Paul van Staden
University of Pretoria
Paul J. van Staden is a senior lecturer in the Department of Statistics at the University of Pretoria, South Africa. He earned his PhD in Mathematical Statistics from the University of Pretoria in 2014. His research focuses on statistical distribution theory, in particular the construction of generalized families of probability distributions in the quantile statistical universe. With respect to sports analytics, his research interests include team and player performance measures in cricket, modeling of score distributions in rugby and cricket, and rating and ranking systems. He has furthermore collaborated in various research projects ranging from tongue protrusion and bite mark analysis in forensic pathology to oral and dental diseases in cheetahs.
Points, rating & ranking systems in professional tennis
The Association of Tennis Professionals (ATP) for men’s tennis and the Women's Tennis Association (WTA) use rolling 52-week points-accumulation rating systems. These systems rank tennis players according to points awarded for their performances in tournaments without taking strength of opponents or score differences into account. Because of the coronavirus pandemic, the 2020 tennis calendar had to be revised with tennis tournaments cancelled or postponed. Consequently the ATP and the WTA froze their rankings on 16 March 2020 and have subsequently introduced revised rating systems, which are still based on accumulation of points. This paper develops a points-exchange rating system for tennis with the rating points of the competing players adjusted by equal, opposite amounts after each tennis match based on the margin of victory in that match. Since margin of victory cannot be directly calculated from traditional tennis scores, a points system for tennis matches is also presented in which tennis scores are converted into performance points. The proposed points and rating systems are illustrated with the 2019 Wimbledon Championships.
Chris Watkins is an analyst and contributing writer to sites such as HockeyGraphs, RawCharge and Blue Seat Blogs. In addition he is the occasional host of the Let's Do That Hockey podcast, which explores the intersection of analytics, on-ice strategy and race and society within the sport of hockey.
Racial Bias in Drafting and Development: The NHL’s Black Quarterback Problem
Coming soon!
Workshops
Meghan Hall is a higher education data professional by day and an amateur hockey analyst by night. She contributes to Hockey-Graphs and also creates tutorials and resources, particularly for beginners, for R and Tableau because she loves helping other people learn these tools to make their work easier. She has presented her hockey analysis work, on league-wide trends in goalie pulling and features of the penalty kill, at the Seattle Hockey Analytics Conference, the Rochester Institute of Technology Sports Analytics Conference, the Ottawa Hockey Analytics Conference, and the Columbus Blue Jackets Hockey Analytics Conference. She also presented virtually at Hockey (Analytics) Night in Canada on how and why to learn a programming language.
Leveling Up With The Tidyverse (And Hockey Data)
In this workshop targeted to beginner and intermediate R users, we'll go through a sample hockey analysis question and discuss techniques from the tidyverse and beyond (including user-defined functions, RMarkdown, custom ggplot2 themes, and data manipulation with tidyr and dplyr) to learn how to level up your R programming and make your analysis more efficient and more reproducible.
Tom Mock works in Customer Success at RStudio, helping empower their High Technology and Sports customers to fully leverage open-source Data Science with R and Python via RStudio's Professional software. Tom holds a MS in Exercise Physiology from FAU, and a PhD in Neurobiology from UNTHSC. He founded #TidyTuesday to help newcomers and seasoned vets improve their Tidyverse skills. In his spare time, he gives back to the open source community by writing guided tutorials (mostly with nflfastR data) at themockup.blog.
Introduction to machine learning with the tidyverse and nflfastR data
This workshop provides a gentle introduction to supervised machine learning: concepts, methods, and R code. Participants will learn how to apply a few common methods on open-source NFL data via nflfastR. Along the way, I'll introduce several core tidymodels packages, which provide a grammar for modeling that makes it easy to do the right thing, and harder to accidentally do the wrong thing.
Live Interview: Too Many Men with Alexandra Mandrycky
Too Many Men is a podcast co-hosted by Sara Civian, Shayna Goldman, and Alison Lukan. It is a hockey and sports adjacent show focused on giving voice to the real thoughts of some women in the sports world. Irreverent, honest, and funny, these women share what's on their mind, who's on their sh*t list, and more.
Alexandra Mandrycky is the Director of Hockey Strategy and Research for the Seattle Kraken. She and her group facilitate the use of data and technology throughout the Hockey Operations department. Alexandra previously spent four seasons as Hockey Operations Analyst with the Minnesota Wild. Alexandra graduated from Georgia Tech in 2013 with a B.S. in Industrial Engineering.
Sara Civian is a Staff Writer for The Athletic, covering the Hurricanes. Before The Athletic she covered the Boston Bruins for WEEI 93.7-FM. She spent the five years prior at Penn State, where she served as managing editor for Onward State.
Live Interview: Chilling with Charlie with Doug Fearing
World's best dog interviews people that she finds interesting in sports analytics. Chilling With Charlie started with the simple premise, wanting to share peoples journeys and what got them into sports analytics in the first place. Charlie hopes to be an avenue whereby up and coming sports analytics professionals can come on and promote their work, themselves and their ideas. Through sharing these conversations with a wider audience Charlie hopes that it encourages more people to enter the sports analytics community.
Doug Fearing is the Founder and President of Zelus Analytics, an Austin-based sports analytics startup creating the world’s best sports intelligence platform. Prior to founding Zelus, he spent four seasons building the R&D department for the L.A. Dodgers, reaching the World Series in 2017 and 2018. Doug has held tenure-track faculty positions at Harvard Business School and UT Austin’s McCombs School of Business. He received his Ph.D. in Operations Research from MIT and his B.S. in Computer Science at CMU.
Robert Nguyen is a PhD student in statistics at the University of New South Wales. He loves kicking the footy with his pet dog (Charlie), watching his AFL team win (Eagles) and encouraging reproducible sports analytics in the Australian sports space.
CMSACamp 2020 Student Speakers
Bria Cratty (view slides)
University of Miami
My name is Bria Cratty, and I am a senior at the University of Miami studying individualized general business with a minor in psychology. I am also in the Dean’s Scholar Program, an accelerated program that allows me to complete my undergraduate degree in three years and a Master of Science in Business Analytics my following fourth year. This summer I worked with Jack de la Parra on our final project for the Carnegie Mellon Sports Analytics Camp. After graduation, I plan to start a career in strategy analytics and/or consulting.
Redefining the Penalty Box: Does the Punishment Fit the Crime?
In soccer, the practice of awarding a penalty kick to a player fouled near the goal has been a long-standing method attempting to keep the game as fair as possible. Recently, watching games play out and stats accumulate has indicated that penalty kicks are converted to goals at a rate much higher than shots are scored during regular play. Previous studies have shown many inconsistencies in how and when fouls are called and the potentially monumental impact a penalty kick can have on the outcome of a match. This identified a need for the redefinition of the current penalty kick to create a scoring opportunity more comparable to the opportunity players would have had in the absence of the foul. Using StatsBomb open data at the event level for the 2018 Men’s World Cup, we used a generalized additive model approach to predict expected goals for different distances and angles to the goal. The goal was to change the location of the penalty kick, so the chances of scoring were closer to the chances during regular play. We proposed replacing the current penalty kick spot with a penalty kick arc. The format of the arc would position the kick-taker based on where the foul occurred within the box as well as extending the distance from which the kick is taken. We believe this new method of penalty kicks will allow for a fairer experience while keeping the traditional aspects of the game fairly constant.
Raj Dasani (view slides)
UC Berkeley
My name is Raj Dasani and I am a 3rd year student at UC Berkeley studying Business and Data Science. I currently work for the women's basketball team at Cal as a data and film analyst and am aspiring to work in sports analytics whether for a professional basketball team or a sports-themed organization. Outside of academics and career, I play the trumpet as part of the University Marching Band!
Quantifying Passing: Using NBA Tracking Data to Create an Expected Assist Model
In the game of basketball, the success of a possession hinges not just on the quality of a shot but also on the quality of the preceding passes. Although much research has been done modeling the expected outcome of a shot, much less is devoted to the value added from a pass. This paper attempts to bridge that gap by offering an Expected Assists model. Using publicly available tracking data from the 2015-16 NBA season, we employ a combination of a rules-based approach and a Generalized Additive Model to identify when a pass occurs. Due to heavy computational limitations, our research concentrates on all Los Angeles Clippers games from December 2015. We then develop a shot model that projects the chance of a player making a shot given his location and the defense around him. We incorporate this into a passing model that takes into consideration the timing and location in order to create an expected points added metric for each pass. This metric allows us to identify passers who are creating good shooting opportunities for their teammates without penalizing them when their teammates are unable to convert. Isolating an individual player’s play-making from team success is a key component of equitable player evaluation.
Fiona Dunn (view slides)
Kenyon College
Fiona is a junior at Kenyon College where she is studying Economics with a minor in Statistics. She is an avid Boston sports fan and is a member of the Kenyon Varsity Women’s Soccer team.
High Anticipation: Exploring Trends Between Public Perception and Player Value
Being a professional basketball player is one of the many ways one can be at the forefront of the public’s eye. The way they play, their actions off the court, and their social media presence are all factors that can affect how a player is perceived by the public. This research aims to determine the trends of a relationship between public perception and a player’s performance in a game. The research focuses on the 2018 NBA first round draft class. The data consists of a series of data scrapes from Reddit, Wikipedia, and Google to gauge public perception and NBA box score data for player performance data. Hierarchical clustering was the best type of analysis to accurately display our public perception data which consisted of sentiment and popularity metrics. Two clusters were created: players who were not very popular with low positive sentiments and players who are more popular with higher positive sentiment . An XGBoost classification model was trained with an 0.887 AUC score to predict which players would fall into each cluster based on basketball performance variables like number of 3-point goals and assists. The results showed that points scored by a player and minutes on the court were given the most importance by the model. These results indicate a relationship between a player’s performance and public perception but confirming this would require an expansion of the sample and complex modeling.
Meg Ellingwood (view slides)
Kenyon College
Meg is a senior at Kenyon College majoring in Psychology and Anthropology with a minor in Statistics and a concentration in Scientific Computing. She is an avid hockey fan (go Blue Jackets!), and hopes to pursue a career in statistics.
Draft Decisions in Uncertain Times: Valuing and Simulating NHL Draft Picks
The NHL draft lottery is complicated even at the best of times, but it is especially so this year, when the Pittsburgh Penguins must decide what to do with a conditional first-round pick traded to the Minnesota Wild. Using publically available data on player statistics and team performance, we built two models. The first model is a NHL draft slot value curve, which predicts a player’s contribution to his team based on his draft position, while the second model predicts a team’s performance in the upcoming season based on the previous season’s statistics. We used this second model to simulate league-wide season rankings and then first-round draft pick orders to generate a distribution of possible outcomes. Combining these two models, we created a one-number index to compare the value of this year’s pick with the probable value of next year’s pick. Based on our modeling and simulation, a solution to this dilemma is proposed to maximize the Penguins' chances of future success.
James Hyman (view slides)
Syracuse University
James is a senior sport analytics and neuroscience major at Syracuse, where he will be pursuing a master’s in applied data science next fall.
Quantifying Passing: Using NBA Tracking Data to Create an Expected Assist Model
In the game of basketball, the success of a possession hinges not just on the quality of a shot but also on the quality of the preceding passes. Although much research has been done modeling the expected outcome of a shot, much less is devoted to the value added from a pass. This paper attempts to bridge that gap by offering an Expected Assists model. Using publicly available tracking data from the 2015-16 NBA season, we employ a combination of a rules-based approach and a Generalized Additive Model to identify when a pass occurs. Due to heavy computational limitations, our research concentrates on all Los Angeles Clippers games from December 2015. We then develop a shot model that projects the chance of a player making a shot given his location and the defense around him. We incorporate this into a passing model that takes into consideration the timing and location in order to create an expected points added metric for each pass. This metric allows us to identify passers who are creating good shooting opportunities for their teammates without penalizing them when their teammates are unable to convert. Isolating an individual player’s play-making from team success is a key component of equitable player evaluation.
Alex Lagarde (view slides)
Elon University
Alex Lagarde is a junior year student at Elon University, where he is majoring in statistics with a concentration in data analytics. After graduation, he plans to pursue a Master's Degree and a career in data analytics, with interests in sports and business intelligence.
Quantifying Passing: Using NBA Tracking Data to Create an Expected Assist Model
In the game of basketball, the success of a possession hinges not just on the quality of a shot but also on the quality of the preceding passes. Although much research has been done modeling the expected outcome of a shot, much less is devoted to the value added from a pass. This paper attempts to bridge that gap by offering an Expected Assists model. Using publicly available tracking data from the 2015-16 NBA season, we employ a combination of a rules-based approach and a Generalized Additive Model to identify when a pass occurs. Due to heavy computational limitations, our research concentrates on all Los Angeles Clippers games from December 2015. We then develop a shot model that projects the chance of a player making a shot given his location and the defense around him. We incorporate this into a passing model that takes into consideration the timing and location in order to create an expected points added metric for each pass. This metric allows us to identify passers who are creating good shooting opportunities for their teammates without penalizing them when their teammates are unable to convert. Isolating an individual player’s play-making from team success is a key component of equitable player evaluation.
Ashley Mullan (view slides)
University of Scranton
I am a junior at the University of Scranton, where I am a double major in Applied Mathematics and Philosophy with a concentration in Data Science. My interests include applications of statistics to the humanities.
A Puck Above the Rest: Exploring the Effects of New Data on 2020 NHL Draft Decisions
Our work aimed to understand how the addition of new season data to existing player records would impact a player's place in the National Hockey League draft. Previous authors used generalized additive models in a similar context to measure time on ice as an indicator of player success, and many used the relative age of the player as one of their predictor variables. Our custom model incorporates both a logistic model and a generalized additive model to compare player's draft year statistics with their statistics from one year later. The indicator of a player’s performance was the percentage of a team’s points per game that he scored. The model outputs the product of the player's probability of being successfully drafted with the player's expected NHL contribution given that he is drafted. Future work may benefit from considering different subsets of players and assessing league strength to generalize the results.
Jack de la Parra (view slides)
Bucknell University
My name is Jack de la Parra and I am a junior at Bucknell University studying Mathematical Economics with a concentration in Statistics. I worked alongside Bria Cratty this summer at the Carnegie Mellon Sports Analytics Summer Research Program. I have always had an interest in mathematics and statistics and look to begin a career in sports analytics, data or actuarial science.
Redefining the Penalty Box: Does the Punishment Fit the Crime?
In soccer, the practice of awarding a penalty kick to a player fouled near the goal has been a long-standing method attempting to keep the game as fair as possible. Recently, watching games play out and stats accumulate has indicated that penalty kicks are converted to goals at a rate much higher than shots are scored during regular play. Previous studies have shown many inconsistencies in how and when fouls are called and the potentially monumental impact a penalty kick can have on the outcome of a match. This identified a need for the redefinition of the current penalty kick to create a scoring opportunity more comparable to the opportunity players would have had in the absence of the foul. Using StatsBomb open data at the event level for the 2018 Men’s World Cup, we used a generalized additive model approach to predict expected goals for different distances and angles to the goal. The goal was to change the location of the penalty kick, so the chances of scoring were closer to the chances during regular play. We proposed replacing the current penalty kick spot with a penalty kick arc. The format of the arc would position the kick-taker based on where the foul occurred within the box as well as extending the distance from which the kick is taken. We believe this new method of penalty kicks will allow for a fairer experience while keeping the traditional aspects of the game fairly constant.
Caleb Peña (view slides)
California State University, Fullerton
Caleb Peña is a Research and Development intern at NHL Seattle. He is currently finishing his senior year at California State University, Fullerton where he is majoring in mathematics with a concentration in probability and statistics.
Quantifying Passing: Using NBA Tracking Data to Create an Expected Assist Model
In the game of basketball, the success of a possession hinges not just on the quality of a shot but also on the quality of the preceding passes. Although much research has been done modeling the expected outcome of a shot, much less is devoted to the value added from a pass. This paper attempts to bridge that gap by offering an Expected Assists model. Using publicly available tracking data from the 2015-16 NBA season, we employ a combination of a rules-based approach and a Generalized Additive Model to identify when a pass occurs. Due to heavy computational limitations, our research concentrates on all Los Angeles Clippers games from December 2015. We then develop a shot model that projects the chance of a player making a shot given his location and the defense around him. We incorporate this into a passing model that takes into consideration the timing and location in order to create an expected points added metric for each pass. This metric allows us to identify passers who are creating good shooting opportunities for their teammates without penalizing them when their teammates are unable to convert. Isolating an individual player’s play-making from team success is a key component of equitable player evaluation.
Sahana Rayan (view slides)
Purdue University
I am a junior at Purdue University majoring in Applied Statistics and Computer Science. Over the last 4 years, I have been working on Machine learning and Data Science projects that have piqued my interest in a variety of applications of data.
High Anticipation: Exploring Trends Between Public Perception and Player Value
Being a professional basketball player is one of the many ways one can be at the forefront of the public’s eye. The way they play, their actions off the court, and their social media presence are all factors that can affect how a player is perceived by the public. This research aims to determine the trends of a relationship between public perception and a player’s performance in a game. The research focuses on the 2018 NBA first round draft class. The data consists of a series of data scrapes from Reddit, Wikipedia, and Google to gauge public perception and NBA box score data for player performance data. Hierarchical clustering was the best type of analysis to accurately display our public perception data which consisted of sentiment and popularity metrics. Two clusters were created: players who were not very popular with low positive sentiments and players who are more popular with higher positive sentiment . An XGBoost classification model was trained with an 0.887 AUC score to predict which players would fall into each cluster based on basketball performance variables like number of 3-point goals and assists. The results showed that points scored by a player and minutes on the court were given the most importance by the model. These results indicate a relationship between a player’s performance and public perception but confirming this would require an expansion of the sample and complex modeling.
Jill Reiner (view slides)
Denison University
Jill Reiner is a junior at Denison University majoring in Data Analytics with a concentration in Economics and a minor in Mathematics. With respect to sports analytics, she is interested in draft strategy and prospect analysis in all sports, especially hockey.
Draft Decisions in Uncertain Times: Valuing and Simulating NHL Draft Picks
The NHL draft lottery is complicated even at the best of times, but it is especially so this year, when the Pittsburgh Penguins must decide what to do with a conditional first-round pick traded to the Minnesota Wild. Using publically available data on player statistics and team performance, we built two models. The first model is a NHL draft slot value curve, which predicts a player’s contribution to his team based on his draft position, while the second model predicts a team’s performance in the upcoming season based on the previous season’s statistics. We used this second model to simulate league-wide season rankings and then first-round draft pick orders to generate a distribution of possible outcomes. Combining these two models, we created a one-number index to compare the value of this year’s pick with the probable value of next year’s pick. Based on our modeling and simulation, a solution to this dilemma is proposed to maximize the Penguins' chances of future success.
Thea Sukianto (view slides)
Boise State University
Thea Sukianto is a senior undergraduate student in Applied Mathematics at Boise State University. Her main research focus is spatio-temporal statistics with applications in agriculture and seismology. In sports analytics, she is interested in simulating game outcomes and constructing rating systems.
Evaluating Parametric Methods for Modeling European Soccer Team Goals
There is no doubt that soccer is one of the most watched and favored sports around the world. With this background, soccer betting has become a popular event for soccer fans and gamblers. To bet on the right team, we desire to know which team would score more goals in a specific match, i.e., we want to know which team has the best rating. In the realm of parametric models, zero-inflation and overdispersion have created challenges in using the traditional Poisson goal count-based model for ratings. As an alternative, previous research has proposed a negative binomial GLM to address overdispersion and zero-inflated Poisson to address zero-inflation. However, comparative studies for determining the best probability distribution to model such ratings are extremely rare. In this study, we compare the performance of four GLMs: Poisson, negative binomial, and their zero-inflated analogues, on predicting match outcome in various European leagues. Through maximum likelihood estimation and match simulation using model parameters, we found that the zero-inflated Poisson GLM yields the best prediction performance on Premier League matches during the 2015-16 season. Moreover, through the multi-category Brier score comparison, we found that this result could potentially be extended to other European leagues as well.
Lucy Ward (view slides)
University of Wyoming
Lucy Ward is a fourth year undergraduate completing a degree in statistics at the University of Wyoming. She plans to pursue graduate studies and a career in statistics, with a focus on public health and medical research.
A Puck Above the Rest: Exploring the Effects of New Data on 2020 NHL Draft Decisions
Our work aimed to understand how the addition of new season data to existing player records would impact a player's place in the National Hockey League draft. Previous authors used generalized additive models in a similar context to measure time on ice as an indicator of player success, and many used the relative age of the player as one of their predictor variables. Our custom model incorporates both a logistic model and a generalized additive model to compare player's draft year statistics with their statistics from one year later. The indicator of a player’s performance was the percentage of a team’s points per game that he scored. The model outputs the product of the player's probability of being successfully drafted with the player's expected NHL contribution given that he is drafted. Future work may benefit from considering different subsets of players and assessing league strength to generalize the results.
Alana Willis (view slides)
Winston-Salem State University
Alana Willis is a graduating senior at Winston-Salem State University in North Carolina studying mathematics with a concentration in statistics. She is currently working as Teaching Assistant, treasurer for the Math and Stat Club, and served as a manager for the Women's Basketball Team. She has hopes of becoming a data analyst for the WNBA.
High Anticipation: Exploring Trends Between Public Perception and Player Value
Being a professional basketball player is one of the many ways one can be at the forefront of the public’s eye. The way they play, their actions off the court, and their social media presence are all factors that can affect how a player is perceived by the public. This research aims to determine the trends of a relationship between public perception and a player’s performance in a game. The research focuses on the 2018 NBA first round draft class. The data consists of a series of data scrapes from Reddit, Wikipedia, and Google to gauge public perception and NBA box score data for player performance data. Hierarchical clustering was the best type of analysis to accurately display our public perception data which consisted of sentiment and popularity metrics. Two clusters were created: players who were not very popular with low positive sentiments and players who are more popular with higher positive sentiment . An XGBoost classification model was trained with an 0.887 AUC score to predict which players would fall into each cluster based on basketball performance variables like number of 3-point goals and assists. The results showed that points scored by a player and minutes on the court were given the most importance by the model. These results indicate a relationship between a player’s performance and public perception but confirming this would require an expansion of the sample and complex modeling.
Zhiwei Xiao (view slides)
University of Michigan
Senior student studying Mathematics and Statistics in University of Michigan, interested in bioinformatics and machine learning research.
Evaluating Parametric Methods for Modeling European Soccer Team Goals
There is no doubt that soccer is one of the most watched and favored sports around the world. With this background, soccer betting has become a popular event for soccer fans and gamblers. To bet on the right team, we desire to know which team would score more goals in a specific match, i.e., we want to know which team has the best rating. In the realm of parametric models, zero-inflation and overdispersion have created challenges in using the traditional Poisson goal count-based model for ratings. As an alternative, previous research has proposed a negative binomial GLM to address overdispersion and zero-inflated Poisson to address zero-inflation. However, comparative studies for determining the best probability distribution to model such ratings are extremely rare. In this study, we compare the performance of four GLMs: Poisson, negative binomial, and their zero-inflated analogues, on predicting match outcome in various European leagues. Through maximum likelihood estimation and match simulation using model parameters, we found that the zero-inflated Poisson GLM yields the best prediction performance on Premier League matches during the 2015-16 season. Moreover, through the multi-category Brier score comparison, we found that this result could potentially be extended to other European leagues as well.
Reproducible Research Competition
Open Track Finalists
Gregory Matthews (view paper)
Loyola University Chicago
Dr. Gregory J. Matthews is an associate professor of statistics and director of the data science program at Loyola University Chicago. He received his Ph.D. in statistics from the University of Connecticut in 2011 and completed a postdoctoral fellowship in the School of Public Health at the University of Massachusetts Amherst in 2014. In 2016, he, along with Ben Baumer and Shane Jensen, won the SABR Conference Research Award for Contemporary Baseball Analysis for his work on openWAR, and in 2014 he won the March Machine Learning Mania Kaggle contest as part of a team with Mike Lopez.
Bang the Can Slowly: An Investigation into the 2017 Houston Astros
This manuscript is a statistical investigation into the 2017 Major League Baseball scandal involving the Houston Astros, the World Series championship winner that same year. The Astros were alleged to have stolen their opponents' pitching signs in order to provide their batters with a potentially unfair advantage. This work finds compelling evidence that the Astros on-field performance was significantly affected by their sign-stealing ploy and quantifies the effects. The three main findings in the manuscript are: 1) the Astros' odds of swinging at a pitch were reduced by approximately 27\% (OR: 0.725, 95\% CI: (0.618, 0.850)) when the sign was stolen, 2) when an Astros player swung, the odds of making contact with the ball increased roughly 80\% (OR: 1.805, 95\% CI: (1.342, 2.675)) on non-fastball pitches, and 3) when the Astros made contact with a ball on a pitch in which the sign was known, the ball's exit velocity (launch speed) increased on average by 2.386 (95\% CI: (0.334, 4.451)) miles per hour.
Benjamin Robinson (view paper)
Grinding the Mocks
Benjamin Robinson is a data scientist living in Washington, DC and the creator of Grinding the Mocks, a project that tracks how NFL prospects fare in mock drafts. You can follow him on Twitter @benj_robinson and find the Grinding the Mocks project at grindingthemocks.com.
Grinding the Bayes A Hierarchical Modeling Approach to Predicting the NFL Draft
Using the 2018 NFL Draft as a case study, this paper extends my initial work (Robinson 2020) on the efficacy of using mock draft data to forecast player-level draft outcomes. By using the same data and applying a more rigorous analytical approach (Bayesian hierarchical modeling with Markov Chain Monte Carlo simulations), methods are developed that allow for NFL decision makers to more accurately gage when a prospect is likely to be selected in the draft. This log-adjusted measure of Expected Draft Position (EDP) explains about 85 percent of variance in actual log-adjusted draft outcomes. Finally, I discuss the implications of using these metrics to inform draft strategy and compare how EDP relates to on-field production.
Student Track Finalists
Matthew Flancer (view paper)
University of Pittsburgh
Enhancing Public Data Availability and Analysis of Olympic Sports: The Case of College Swimming
While during the last several years popular team sports have experienced a growth in terms of data and analysis that are publicly available, this is not the case with Olympic sports. While national Olympic committees are reportedly using data to make decisions, “public analytics” have not followed suit. Part of the reason can be attributed to the lack of readily available open datasets to the public for these sports. This work aims at filling this gap by developing an open source application for downloading and analyzing data from college swimming. More specifically the application obtains data and processes them in a machine readable format from swimcloud.com. Furthermore, we provide an interactive application for visualizing and analyzing the data with a focus on two specific applications: (a) swimmer progression across seasons, and (b) tapering during the season in terms of achieving optimal performance in their respective conference finals. We hope that this work will lead to more public interest and analysis in swimming and Olympic sports in general.
Paul Ibrahim (view paper)
Cary Academy
Paul Ibrahim is a high school senior at Cary Academy in Cary, North Carolina. His areas of interest are game theory, information theory, and analytics across sports, though his research principally focuses on applications of NBA tracking data.
Comparing Free-Throw Forms Among NBA Players Through 3D Similarity Measures
In this paper, we propose a method to compare free-throw forms of various NBA players. Using publicly available SportVU tracking data from the 2015-16 NBA season [1], we identify instances of free-throw attempts and track the ball’s motion while it’s in the player’s hands in order to isolate the given player’s shot form from the data. To characterize each player’s form, we apply a multivariate kernel density estimation to the sample of the player’s free throw attempts. Furthermore, using a k-means clustering, we attempt to categorize the multivariate kernel density estimates across the sample of players, characterizing each cluster by the cluster mean. We then proceed to apply a variety of three-dimensional similarity measures to the clustered kernel density estimates, therein providing a variety of metrics by which we can assess free-throw form similarity among NBA players.
CALL FOR ABSTRACTS
In an effort to foster intellectual growth and discovery among the statistics and data science community, we gladly welcome research submissions from the public.
Submit your research project using the form by Sept 28th, indicating whether or not you want your submission considered for a contributed talk and/or poster. Note that there are limited spaces available, and abstracts for talks and posters will be accepted on a rolling basis until slots are filled. Final acceptance notifications will be sent out by mid-September.
Here's a recap of important dates and requirements to remember:
- Sept 28th: Abstract submission deadline.
- Abstracts will be selected on a rolling basis, final notification by early-October, 2020.
NOTE: This research submission form is not considered for entry into the reproducible research competition, meaning it does not require publicly available data and sharing of code (nor entry for cash prizes).
TBA
Contact Us
The Carnegie Mellon Sports Analytics Conference is proudly hosted by the Department of Statistics & Data Science
and the Carnegie Mellon Sports Analytics club.
CMSAC Program Committee:
Carnegie Mellon Sports Analytics Club Executives
-
Shravan Ramamurthy
-
Toby Junker
Questions can be directed to cmsac@stat.cmu.edu.
CMSAC Activities Conduct Policy
(modeled on the ASA Activities Conduct Policy approved November 30, 2018 by American Statistical Association Board of Directors)
The Carnegie Mellon Sports Analytics Conference (CMSAC) is committed to providing an atmosphere in which personal respect and intellectual growth are valued and the free expression and exchange of ideas are encouraged. Consistent with this commitment, it is CMSAC policy that all participants in CMSAC activities enjoy a welcoming environment free from unlawful discrimination, harassment, and retaliation. We strive to be a community that welcomes and supports people of all backgrounds and identities. This includes, but is not limited to, members of any race, ethnicity, culture, national origin, color, immigration status, social and economic class, educational level, sex, sexual orientation, gender identity and expression, age, size, family status, political belief, religion, and mental and physical ability.
All CMSAC participants —including, but not limited to, attendees, statisticians, data scientists, sports analysts, students, registered guests, staff, contractors, sponsors, exhibitors, and volunteers —in the conference or any other related activity—whether official or unofficial—agree to comply with all rules and conditions of the activities. Your registration for or attendance at the 2020 Carnegie Mellon Sports Analytics Conference indicates your agreement to abide by this policy and its terms.
Expected Behavior
- Model and support the norms of professional respect necessary to promote the conditions for healthy exchange of scientific ideas.
- Speak and conduct yourself professionally; do not insult or disparage other participants.
- Be conscious of hierarchical structures in the sports analytics and/or broader statistics/data science community, specifically the existence of stark power differentials among students, junior analysts/statisticians, and senior analysts/statisticians—noting that fear of retaliation from those in senior-level positions can make it difficult for students or those in junior level positions to express discomfort, rebuff unwelcome advances, and report violations of the conduct policy.
- Be sensitive to body language and other non-verbal signals and respond respectfully.
Unacceptable Behavior
- Violent threats or language directed against another person
- Discriminatory jokes and language
- Inclusion of unnecessary sexually explicit, violent, or otherwise sensitive materials in presentations
- Posting (or threatening to post), without permission, other people’s personally identifying information online, including on social networking sites
- Personal insults including, but not limited to, those using racist, sexist, homophobic, or xenophobic terms
- Unwelcome solicitation of emotional or physical intimacy such as sexual advances; propositions; sexual flirtations; sexually-related touching; and graphic gestures or comments about sex or another person’s dress, body, or sexual activities
- Advocating for, encouraging, or dismissing the severity of any of the above behaviors.
Consequences of Unacceptable Behavior
At the sole discretion of the CMSAC Program Committee, unacceptable behavior may result in removal from or denial of access to meeting facilities or activities, without refund of any applicable registration fees or costs. In addition, the CMSAC reserves the right to report violations to an individual’s employer or institution or to a law-enforcement agency. Those engaging in unacceptable behavior may also be banned from future CMSAC activities or face additional penalties.
What to Do if You Witness or Are Subject to Unacceptable Behavior
If you are being harassed, notice that someone else is being harassed, or have any other concerns relating to harassment, please contact a member of the CMSAC program committee either in person or at cmsac@stat.cmu.edu. If you witness potential harm to a conference participant, be proactive in helping to mitigate or avoid that harm; if you see or hear something that concerns you, please say something.
Process for Adjudicating Reports of Misconduct
The CMSAC will contract with an independent entity to manage and adjudicate reported violations of the conduct policy.
Note: This Code of Conduct may be revised at any time by the Carnegie Mellon Sports Analytics Conference. Questions, concerns, or comments should be directed to cmsac@stat.cmu.edu.