Executive Summary

Between 2009 and 2021, over 7,200 horses died or were euthanized due to racing-related injuries (Fobar, 2023). Using horse profile data and horse tracking data from the NYRA, we hoped to identify horses who under-raced between 2019 and 2021, cluster movement profiles for horses who raced in 2019 New York races, and discover whether certain profiles were more associated with injured horses.

By fitting a negative binomial model on horse profile data and performing residual analysis, we discovered that at least 251 horses under-raced between 2019 and 2021. Additionally, clustering horse movement profiles revealed that a horse’s speed profile is most associated with its injury status: specifically, greater variation in speed is more associated with injury.

Introduction

Between 2009 and 2021, over 7,200 horses died or were euthanized due to racing-related injuries (Fobar, 2023). Just a month ago, the Churchill Downs in Kentucky had to suspend races because twelve horses have sustained fatal injuries at the track this spring season (Romero, 2023). Several horse racing associations in the United States, such as the New York Racing Association (NYRA), are invested in lowering the prevalence of race-related injuries by understanding what increases injury risk (NYRA Safety, 2020).

At the same time, horse racing data has significantly advanced in recent years. In 2022, the New York Racing Association (NYRA) and New York Thoroughbred Horsemen’s Association (NYTHA) co-sponsored the first-ever Big Data Derby, where horse tracking data was released for public analysis. The released data includes several metrics that had been previously unavailable in horse racing, such as the horse’s longitude and latitude every 0.25 seconds in the race (Addison Howard, 2022).

With the newly-available tracking data, we hope to discover whether certain movement patterns are common for horses who became injured in 2019. In particular, through this project, we aim to contribute the following:

  1. Identify horses who under-raced between 2019 and 2021 to expand the subset of injured horses, from just being those who had a severe injured reported.
  1. Cluster movement profiles for horses who raced in a New York race in 2019 using the newly-available horse tracking data.
  1. Discover whether certain movement clusters are more likely to be associated with horses who become injured in 2019.

We hope that analyzing the movement profiles of horses with the tracking data will help NYRA (and other racing associations) accomplish their goal of reducing the occurrence of race-related injuries and fatalities in horses by seeing whether certain movement profiles are riskier for horses.


Data

Horse Profiles:

The horse profiles data set includes details about each horse who was tracked in a NYRA race during 2019. In the final data set, there are 4,589 unique horses and 40 variables. Each observation in the data set represents a horse and their relevant information, such as the number of races they competed in during 2019, summary statistics for their days between races, and their injury status. We used three sources to collect data about each tracked horse. First, we ascertained severe injury and fatality data through Joe Appelbaum at NYRA. Similarly, we received a list of all horses who started at a NYRA track between 2019 and 2021 from Joe Appelbaum. The relevant variables from the horse profiles data set include:

  • If Injury Reported: whether the horse became severely injured at a NYRA track.

  • Fatal: whether the horse’s reported severe injury was fatal or not.

  • Ever Under-Raced: whether the horse under-raced given their age (in years) and the calendar year (the method used to classify a horse as “under-racing” is described in the methods section below).

Horse Tracking Data:

The horse tracking data set includes information about each tracked NYRA race during 2019. The original data set was publicized as part of the 2022 NYTHA/NYRA Big Data Derby (Addison Howard, 2022) and includes 25 variables for 4,589 unique horses who competed in at least one of 1,991 races for a total of 5,162,881 observations. Each observation in the data set represents a single horse’s position at a particular frame within a race, represented as longitude and latitude. With the help of data cleaning functions from Brendan Kumagai and his team (Stokes et al., 2022), we converted horse positions to Cartesian coordinates (in meters) and produced horse trajectories for each of the following relevant metrics:

  • Speed: a horse’s speed (in meters per second) in the current frame.

  • Acceleration: a horse’s acceleration (in meters per second squared) in the current frame.

  • Lateral Movement: a horse’s side-to-side movement (in meters) from the previous frame to the current frame.


Exploratory Data Analysis (EDA)

Before identifying under-racing horses through an expected counts model, we visualized some key variables with the NYRA Start List Data Set; namely, we visualized the relationship between horse age and race count and the relationship between calendar year and race count to have a better sense of how these variables are associated with a horse’s race count.

Exploratory Data Analysis for Expected Race Count Model

Figure 1: Exploratory Data Analysis for Expected Race Count Model

Figure 1 Key Takeaways: The left side of Figure 1 depicts the relationship between a horse’s age (in years) and their number of races, fit with a simple linear regression model. From the regression line, we see that as horse’s age (in years) increases, the expected number of races the horse will compete in increases as well for horse’s between 1 and 10 years-old. The plot does not give us any indication that we should treat horse’s age as a factored predictor rather than a continuous variable. The right side of Figure 1 illustrates the relationship between the calendar year and each horse’s number of races in that year. The change in median number of races from 2019 to 2020 against the change in median number of races from 2020 to 2021, solidifies that we should treat calendar year as a factored variable (rather than continuous variable) in our expected race count model.

Before computing summary statistics and clustering horse movement profiles, we visualized horse trajectories from the 2019 NYRA Horse Tracking Data Set; namely, we visualized the position, speed, acceleration, lateral movement, cumulative lateral movement, and STRAIN for horses who ran at the Aqueduct on April 19th, 2019. These visualizations provide a better understanding of common patterns of horse movement throughought a race.

Horse Tracking Data from Race 1 at The Aqueduct on April 19th, 2019

Figure 2: Horse Tracking Data from Race 1 at The Aqueduct on April 19th, 2019

Figure 2 Key Takeaways:: Figure 2 depicts the movement of 6 horses during Race 1 at the Aqueduct Racetrack on April 19th, 2019. It is a one-turn, 1600-meter race that culminates at the vertical finish line on the upper side of the track.

Speed and Acceleration Trajectories from the Aqueduct on April 19th, 2019

Figure 3: Speed and Acceleration Trajectories from the Aqueduct on April 19th, 2019

Figure 3 Key Takeaways:: The left side of Figure 3 depicts the speed trajectories for each horse who raced at the Aqueduct Racetrack on April 19th, 2019. Typical speed curves are characterized by a steep increase at the start of the race followed by a slow decrease as the race goes on as seen in Races 1, 2, and 6. However, Races 3, 5, and 9 deviate from this pattern and are defined instead by a plateau in speed as the race progresses. The right side of 3 depicts the acceleration trajectories for each horse (which correspond to their speed trajectories). Typical acceleration curves are characterized by a spike at the start the race followed by an immediate decrease. Volatility around 0 \(\frac{m}{s^2}\) ensues as horses jockey for position. We hypothesize that wide fluctuations in velocity and acceleration may be associated with increased injury risk.

Lateral Movement Trajectories from the Aqueduct on April 19th, 2019

Figure 4: Lateral Movement Trajectories from the Aqueduct on April 19th, 2019

Figure 4 Key Takeaways:: The left side of Figure 4 depicts the lateral movement trajectories for each horse who raced at the Aqueduct Racetrack on April 19th, 2019. Typical lateral movement trajectories sometimes include a spike at the start of the race as horses move to the inner rail, but the widest oscillations ensue in the middle / final stages of a race as horses jockey for position. The right side of Figure 4 depicts the cumulative lateral movement trajectories for each horse (which correspond to their lateral movement trajectories). These curves are strictly increasing, and though they tend to move together, some horses show significant deviation in cumulative lateral movement accumulated throughout a race, which we hypothesize may be associated with increased injury risk.

Strain Trajectories from the Aqueduct on April 19th, 2019

Figure 5: Strain Trajectories from the Aqueduct on April 19th, 2019

Figure 5 Key Takeaways:: Figure 5 illustrates the average strain rate trajectories for each horse who raced at the Aqueduct Racetrack on April 19th, 2019. Typical strain trajectories depict include a spike at the start of the race as horses move to the inner rail, rapidly decreasing the distance between one another. As the race goes on, strain rate over time oscillates around zero, as the shape of the pack doesn’t change over the middle portion of the race. Interestingly enough, we see some patterns begin to emerge: race four (4) depicts one horse’s strain rate increasing for a brief moment while the other horses decrease–this horse is possibly making a move to get into the lead. Conversely, race three (3) depicts a slow shift by all the horses in the race from little-to-no strain to positive strain during the last 25-30 seconds of the race. This could indicate horses jockeying for position during a sprint to the end; the less distance between horses, the more likely it is for one horse to bump into another, potentially increasing the risk of injury.


Methods

Identifying Horses who Under-Raced

Given only 122 out of 4589 horses (2.65%) had reported severe injuries, it would be unlikely that we would have enough injury signal to discover movement trends in injured horses if we just considered severely injured horses. To generate more injury signal, we posited that horses who under-raced relative to other horses their age in a calendar year had more subtle injuries. We could only analyze 931 unique horses for under-racing given we only wanted to include horses who raced in 2019 and the start list data for such horses was limited. Therefore, only 20% of the tracked horses could be evaluated for under-racing.

We chose to control for horse age (in years) and the calendar year because the expected number of races for a 2 year old horse would be very different than the expected race count for a 10 year old horse in the same calendar year. Similarly, we would expect the race count for horse that is 4 years old in 2019 to be different than a horse who is 4 years old in 2020 given the COVID-19 pandemic. Therefore, we developed a model of expected race counts based on the horse’s age (in years) and the calendar year (ranging from 2019-2021) to expand our set of injured horses.

Given we wanted to find an expected count, we considered using both Poisson and Negative Binomial models. Due to the presence of overdispersion in our data, we ultimately opted to use the Negative Binomial model to determine the expected number of races for each horse given their age and the calendar year (Negative Binomial Regression). Per Figure 1, we treated horse’s age as continuous and the calendar year as a factor.

Hence, we fit the following Negative Binomial model:

\[\text{log}(\mathbb{E}\text{(Race Count))} = \beta_0 + \beta_1\text{(Age in Years)} + \beta_2\text{(Race Year = 2020)} + \beta_3\text{(Race Year = 2021)}\]

Then, we calculated the standardized residual for each horse.

The standardized residual for the \(i^{th}\) observation is defined as follows:

\[i^{th}\text{ Standardized Residual }= \frac{y_i - \hat{y_i}}{SE}\]

where SE is the standard error of the residuals, \(y_i\) is the actual number of races for the \(i^{th}\) observation, and \(\hat{y}_i\) is the predicted number of races for the \(i^{th}\) observation (Brannick).

Finally, we classified a horse as under-racing if their standardized residual fell below -1 for any combination of their age and the calendar year that appeared in the data set.

Motivation and Implementation of STRAIN

In materials science, (Callister & Rethwisch, 2018) strain measures the deformation of a material from stress as the change in its length relative to its original length. Formally, let \(D(t)\) be the distance between any two points of interest within a material at time \(t\) and \(D_0\) be the initial distance between those same points. The strain for a given material at time \(t\) is defined as

\[\epsilon(t)=\frac{D(t)-D_0}{D_0}.\]

This metric is a ratio of two quantities of the same unit and is therefore unitless.

Accordingly, strain rate measures the change in the deformation of a material with respect to time. Mathematically, the strain rate of a material is expressed as the derivative of its strain, that is:

\[\epsilon'(t)=\frac{d \epsilon}{dt}=\frac{v(t)}{D_0},\]

where \(v(t)\) is the velocity at which two points of interest within a material are moving away from or towards each other. This metric, unlike strain, is measured in inverse of time, typically inverse second.

Motivated by its scientific definition and previous work analyzing NFL pass rushing (Nguyen et al., 2023), we draw an analogy between strain rate and horse racing. Just as strain rate is a measure of deformation in materials science, horses in a race apply deformation to one another in an effort to break away from the pack and win the race. Each horse can be viewed as a “particle”, with strain rate calculated pairwise between each horse.

In order to apply this metric to horse racing, we draw inspiration from previous work analyzing NFL pass rushing (Nguyen et al., 2023) and introduce slight modifications to how the concept is usually defined. Let \((x_{ijt}, y_{ijt})\) be the \((x, y)\) location on the track of horse \(j=1,2,...,J_i\) at frame \(t=1,2,...,T_i\) for race \(i=1,2,...,n\).

  • The distance between any two distinct horses \(j\) and \(j^*\) at frame \(t\) during race \(i\) is

\[s_{ijj^*}(t)=\sqrt{(x_{ijt}-x_{ij^*t})+(y_{ijt}-y_{ij^*t})}.\]

  • The velocity at which two distinct horses \(j\) and \(j^*\) are moving towards each other at frame \(t\) during race \(i\) is

\[v_{ijj*}(T)=s'_{ijj*}(t)=\frac{ds_{ijj*}(t)}{dt}\]

  • The STRAIN between two distinct horses \(j\) and \(j^*\) at frame \(t\) during race \(i\) is

\[STRAIN_{ijj^*}(t)=\frac{-v_{ijj*}(t)}{s_{ijj*}(t)}.\]

To distinguish this metric from strain and strain rate in materials science, we write it in capital letters (STRAIN) for the remainder of this report.

Similarly to previous work on pass rushing analysis, we negate the approach velocity \(v_{ij}(t)\) between horses to ensure STRAIN increases as the distance between horses decreases. Additionally, the initial distance \(s_{ij}(t)\) between horses is updated frame-by-frame, enabling the calculation of STRAIN for each frame throughout a race.

Since distance and velocity are observed discretely in increments of 4 frames/second, a point estimate for our STRAIN metric between distinct horses \(j\) and \(j^*\) at frame \(t\) during race \(i\) is

\[\widehat{STRAIN_{ijj^*}(t)}=\frac{-\frac{s_{ijj^*}(t)-s_{ijj^*}(t-1)}{0.25}}{s_{ijj^*}(t)}.\]

This measure increases in two ways: 1) the rate at which two horses are moving towards each other increases, and 2) the distance between two horses decreases. Both of these are indicators that a horse is moving towards the pack and running under higher stress.

Additionally, since we observe STRAIN between every pair of horses in a race, we can compute the total STRAIN for a horse at each frame. Let \(Z_i\) represent the set of opposing horses in race \(i\). Formally, the total STRAIN for a horse \(j\) at frame \(t\) during race \(i\) is

\[STRAIN_{ij}(t) = \sum_{j^* \in Z_i} \widehat{STRAIN_{ijj^*}(t)}.\]

Additionally, to control for the number of horses in a race, we also consider the average total STRAIN of a horse, denoted by \(\overline{STRAIN}\). Let \(Z_i\) represent the set of opposing horses in race \(i\). For a horse \(j\) at frame \(t\) during race \(i\), average total STRAIN is

\[\overline{STRAIN_{ij}}(t) = \frac{1}{|Z_i|}*\sum_{j^* \in Z_i} \widehat{STRAIN_{ijj^*}(t)}.\]

Both of these metrics are useful for evaluating the magnitude of STRAIN experienced by horses in a race and provide novel features for predicting future injury.

Principal Component Analysis

Principal component analysis (PCA) is an unsupervised learning method used for dimensionality reduction (Timmerman, 2003). It works by transforming to a new set of variables, referred to as principal components (PCs), which are independent of one another, and are ordered in terms of maximizing the variance explained in the original dataset (Shlens, 2014). In this context, we did not apply PCA directly on the NYRA tracking data itself–we first computed numerous summary statistics upon features in the data such as median horse speed, average horse acceleration, maximum strain rate, and range of lateral movement. We then performed principal component analysis on the set summary statistics of the features separate from one another. Using a 90% threshold for cumulative proportion of variance explained, we took the top variable from the corresponding principal component.

Features Number of Summary Statistics, as determined from PCA
Speed 4
Acceleration 4
Lateral Movement 4
Strain Rate 5

Clustering Horse Trajectories

We utilized a Gaussian Mixture Model (GMM) to categorize the 4,588 horses into different clusters based upon their values for the relevant summary statistics (note that one horse was omitted from the analysis, as it has no information available for its races in 2019). A Gaussian Mixture Model is a probabilistic model that assumes the data is generated from a mixture of Gaussian distributions. Its advantages include flexibility (employing Bayesian information criterion, a penalized likelihood measure, to determine the number of clusters), soft clustering (probabilistic approach, as opposed to deterministic), and data generation (Fraley & Raftery, 2002).

Cluster Number Number of Horses
1 1210
2 1011
3 880
4 490
5 138
6 759
7 85
8 15

Results

Horses who Under-Raced

Figure 6 shows that all the coefficients in our fitted model have a p-value of approximately 0, meaning that age (in years) and the calendar year (compared to 2019) are significant predictors of expected race counts for a horse. One caveat is that our standardized residuals appear skew right rather than symmetrical around zero (Figure 6). Though, the distribution of standardized residuals still has a mean value of 0 and a variance of approximately 1, with 95.5% of the standardized residuals being within two standard deviations of the mean value of 0. Therefore, our Negative Binomial model is appropriate for predicting the expected race count for each combination of age and calendar year in the data set.

Evaluation Visualizations for Expected Race Count Model

Figure 6: Evaluation Visualizations for Expected Race Count Model

Of the 931 horses were able to analyze for under-racing given their age and the calendar year, we found that 251 unique horses under-raced between 2019 and 2021. 75 out of those 251 horses under-raced more than once. Only one horse who under-raced had a severe injury reported in 2019 and the year they under-raced was 2021. No horses with a standardized residual less than -1 had a fatal injury in 2019 (Figure 7).

Figure 7: Standardized Residuals versus Fitted Values for Race Count given the Horse’s Age and the Calendar Year

Examining the horses with severe injuries in 2019 further, we found that horses with severe injuries were more likely to race more than their expected count in 2019 (Figure 8). In contrast, in 2020 and 2021, these horses were more likely to race less than expected but not significantly so, given only one severely injured horse had a standardized residual less than -1 during 2020 and 2021 (Figure 7 and 8). These results gives us some evidence that horses who become severely injured are actually more likely to over-race than under-race relative to other horses their age in a given calendar year.

Comparison Between Racing More or Less than Expected
for Horses who were Severely Injured in 2019

Figure 8: Comparison Between Racing More or Less than Expected for Horses who were Severely Injured in 2019

Cluster Membership for Horses who Under-Raced or had a Reported Injury

Figure 9 below shows the proportion of horses in each cluster who suffered an injury for each of speed, acceleration, strain rate, and lateral movement. We found that the 8th cluster of the speed movement profile saw a proportion of injured horses that is close to 50%. Considering there are only 15 horses in this cluster–less than one percent of all horses in the data–there is evidence of a cluster 8 being representative of a potentially harmful speed profile. Similar conclusions can be drawn for Cluster 8 of the acceleration profiles: a significant proportion of injuries (approximately 25%) with a small sample size (27 horses). However, unlike the aforementioned clusters, Cluster 5 of the speed profiles has a larger proportion of horses (138 horses, or 3% of all horses in the data), and a relatively significant proportion of injuries (> 10%).

Clustering of Horse Movement Profiles, Colored by Injury

Figure 9: Clustering of Horse Movement Profiles, Colored by Injury

Upon further investigation, it’s clear that Speed Cluster 8 displays greater variation in horse speed other clusters (Figure 9). Specifically, Cluster 8 has the greatest range of average speed and the greatest coefficient of variation of speed. Furthermore, possessing the lowest average median speed and average minimum speed respectively could signal horses who are fatigued or hurt prior to suffering serious injuries.

Cluster Average Median Speed Average Min. Speed Average Max. Speed Average CV Speed Average Race Distance Average Weight Carried
8 16.10 1.24 21.13 0.44 1419.43 120.92
5 16.32 4.39 19.33 0.17 1345.17 120.22
3 16.82 2.70 19.49 0.13 1409.98 120.10
1 17.48 3.91 19.84 0.12 1304.27 120.11
7 17.10 4.29 22.16 0.12 1434.67 119.92
6 16.67 6.42 19.24 0.11 1386.32 119.64
2 17.10 3.67 18.93 0.10 1695.62 120.70
4 17.39 7.56 18.98 0.09 1567.78 120.47

Figure 10 displays the proportion of horses in each cluster who raced less than expected for 2019, 2020, or 2021 for each of speed, acceleration, strain rate, and lateral movement profiles. The relative proportion of injuries is both consistent within clusters and between clusters, suggesting that these trajectories are not indicative of potentially harmful profiles. It should be noted that some clusters have a proportion that is approximately zero. This is likely due to having a higher proportion of injured horses, not age, as seen in the previous model. Those horses who suffer an injury are often kept out of races for an extended period of time, so they are racing as expected, given that they suffered an injury.

Clustering of Horse Movement Profiles, Colored by Under-Racing

Figure 10: Clustering of Horse Movement Profiles, Colored by Under-Racing


Conclusion

In our project, we used horse profile data and horse tracking data from the NYRA to identify horses who under-raced between 2019 and 2021, cluster movement profiles for horses who raced in 2019 New York races, and discover whether certain profiles were more associated with injured horses.

By fitting a negative binomial model on horse profile data and performing residual analysis, we discovered that at least 251 horses under-raced between 2019 and 2021. Additionally, clustering horse movement profiles revealed that a horse’s speed profile is most associated with its injury status: specifically, greater variation in speed is more associated with injury.

We hope that our work on the movement patterns that result in horses being susceptible to injury helps our external partners at the NYRA better meet their goal of optimizing race horse welfare.

Limitations

During this project, our main limitations came from data availability. In particular, calculating some of the frame-specific metrics from the tracking data involved immense computational power. One such example is lateral movement, which we could only acquire for 1600 meter races by Brendan Kumagai graciously sharing his processed data with us. Therefore, we could only analyze lateral movement for horses who ran a 1600 meter race during 2019, reducing our data set to 1678 horses, compared to the data set of 4638 horses we had for strain and speed.

Along similar lines, we could only analyze 931 out of the 4638 horses for under-racing because the NYRA Start Data was limited. Hence we likely under-estimates the true number of horses who under-raced between 2019 and 2021, which in turn limited our ability to identify injury trends in our movement profile clusters.

Future Work

Below are some ideas for future work based on our project:

  • Collecting more starts information to close the gap between the number of horses tracked in 2019 and the number of horses analyzed for under-racing to generate more injury signal.

  • Computing lateral movement for all distances, rather than just 1600m races, to provide more information about lateral movement and its relationship to injury risk.

  • Using a multilevel model to incorporate the distribution of horses as a random effect to capture some of the variation in the data without losing information.


Acknowledgements

We thank Dr. Ron Yurko and Quang Nguyen for all their guidance and encouragement during this project. Additionally, we are thankful to our external partners at NYRA, particularly Joe Appelbaum and Davis Klein, for offering us domain-specific advice. We are also thankful to Brendan Kumagai and his Big Data Derby 2022 team for sharing their data cleaning code and processed data set. Finally, we would like to express our gratitude to Meg Ellingwood, Shamindra Shrotriya, and the rest of SURE 2023 for the opportunity to conduct sports analytics research this summer.


References

Addison Howard, J. A., inversion. (2022). Big data derby 2022. Kaggle. https://kaggle.com/competitions/big-data-derby-2022
Brannick, Michael. Multiple Regression; Research Methods. http://faculty.cas.usf.edu/mbrannick/regression/Part3/DiagnosticNarrative.html.
Callister, W. D. & Rethwisch, D. G. (2018). Materials science and engineering: An introduction (10th ed.). John Wiley & Sons, Inc.
Fobar, R. (2023). Why horse racing is so dangerous. National Geographic. https://www.nationalgeographic.com/animals/article/horse-racing-risks-deaths-sport
Fraley, C. & Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631. https://doi.org/10.2307/3085676
Negative Binomial Regression. https://stats.oarc.ucla.edu/stata/dae/negative-binomial-regression/#:~:text=The%20form%20of%20the%20model,3)%20%2B%20b3math.
Nguyen, Q., Yurko, R. & Matthews, G. J. (2023). Here comes the STRAIN: Analyzing defensive pass rush in american football with player tracking data. https://arxiv.org/abs/2305.10262
NYRA safety. (2020). New York Racing Association (NYRA). https://www.nyrainc.com/about/nyra-safety
Romero, D. (2023). In Churchill Downs, home of Kentucky Derby, suspends racing after 12 horses die. CNBC. https://www.cnbc.com/2023/06/02/churchill-downs-home-of-kentucky-derby-suspends-racing-after-12-horses-die.html#
Shlens, J. (2014). A tutorial on principal component analysis. Educational, 51.
Stokes, T., Kroetch, K., Bagga, G., Welsh, L. & Kumagai, B. (2022). Bayesian velocity models for dynamic horse racing valuation. In GitHub repository. github.com/brenkumi/canadian-pharoah; GitHub.
Timmerman, M. (2003). Principal component analysis (2nd ed.). I. T. jolliffe. Journal of the American Statistical Association, 98, 1082–1083. https://doi.org/10.2307/30045356

  1. Pomona College, ↩︎

  2. University of Florida, ↩︎

  3. North Carolina State University, ↩︎