Final-Project-Report.knit

Analysis of Ultra-Marathon Runners and Races

Srinivasan Sathiamurthy, Rebecca Derham, Nihaar Gupta, Cameron Casey

Introduction

We chose to look at the race and ultra_rankings dataset, which is essentially a massive dataset containing information about ultra long distance (generally 150-175 km) running events, and the performances of runners in those races between 2012 and 2021. The data is broken into a race dataset and a rank dataset. The race dataset pertains to information about the races themselves, which includes features such as the race id, event name, race name, the city and country the race took place in, the date of the event, the start time in GMT, the type of event (solo, relay, or team), distance in kilometers, total elevation gain and loss, number of aid stations, and the number of participants in the race. The rank dataset contains information about runners’ performances, which can be matched up with the races from the race data set using the race id feature, but also includes the runner’s name, age, gender, nationality, rank in the corresponding race, and race time in both an HH/MM/SS format and in seconds. We decided to transform the runners’ nationality and country of race into continental regions to further simplify those features. In addition, we decided to only include solo races since we wanted to only look at runners’ performances at an individual level, and only considered races over at least 150 km, since only 81 out of the 1207 races in the dataset were under 150 km, and these outlying distances were of vastly different magnitudes from the rest.

When looking at these data sets, we were especially curious about which factors might affect runners’ times. It is well known that in a lot of other sports, athletes tend to perform better in their home stadiums, so that led us to begin our analysis with a question of whether ultra-marathon runners tend to perform better in races in their local geographic region. After analyzing the structure of our data set, we looked into which features pertaining to race conditions, such as the season of the race, elevation gain/loss, number of participants, length of distance, and number of aid stations influenced runners’ speed (measured in km per hour). Our third question looked at which characteristics of runners themselves, such as their gender, age, and nationality influenced runners performance (in terms of both speed and rank). Although we could now consider our main analysis over, in our findings we observed that runners’ speed was distributed bimodally, and so as a final question we determine whether there might be underlying structure in our data that might correlate with groups of different runners’ speeds using clustering methods.

1. Do athletes tend to perform better in races in their home country/region?

When answering this question, one should consider the nature of these ultra distance marathons – since these are not as standardized in terms of race conditions (such as distance, weather, and controlled environments) as traditional shorter races are, the essence of the competition remains more of a competitive one than a race against the clock. Additionally, since the distances are variable but of similar magnitude, an average speed is a better measure of runners’ performances across various race conditions and distances is than overall time; we anticipate that a runner’s strategy doesn’t differ too drastically between 150km and 175km-long races, as nobody can sprint 150km. As such, we are presented with two promising metrics for judging runners’ performance: their rank, and their average speed in km per hour. To determine which metric is more useful, it is informative to consider the underlying distribution of runners’ regions as the plot below shows:

It is worth noting that the runners’ nationalities are distributed quite unevenly across the regions, with the majority of runners hailing from Europe and Central Asia (a closer glance at the data set reveals that in fact most of the runners are from only Europe), with the next most sizable portion from North America (mostly just the United States), followed by East Asia and the Pacific. The other four regions are vastly underrepresented. As such, in analyzing our question, we opted to measure runners’ performances by their average speed, since the disproportionate representation of regions would skew most reasonably interpretable metrics based on runners’ ranks (for instance, in a race of 2900 individuals, if 10 of them were French and the rest were Americans, and the French occupied ranks 20-30, then the mean rank of Americans would be much lower than the French, even though the Americans occupied the top 20 spots). Indeed, here is a graph depicting runners’ mean speeds in races in each of the seven regions, groups by runners’ nationalities by region:

At first glance, its clearly apparent that there is no home bias in terms of performance- in fact, the only region for which home bias could plausibly hold is in the Middle East and South Africa, which perhaps might be due to runners from that region being used to training in such extreme climates compared to runners of any other region, who most likely do not face such brutal heat. Another interesting observation is that runners from Sub-Saharan Africa have much higher mean speeds on average than runners from any other region. This might also be due to similar reasoning, as in these races there were apparently no competitors from the Middle East and North Africa, making Sub-Saharanese more suited to the climates of South Asia than any other region. However, we must take these conclusions with a grain of salt, as there are many confounding variables we have not considered in this graph, such as whether runners from certain regions are simply slower than those of other regions, whether only the best runners from non-local regions tended to compete in races outside of their regions, and whether runners from local regions represented their regions’ races disproportionately in races with more optimal conditions. In addition, there are so many bars here and so many comparisons that we could make to find an interesting result that we are essentially falling victim to multiple testing and therefore these “interesting” conclusions may just be due to chance.

2. Which features of the race influence runners’ speed?

We now analyze the relationship between race factors and runners’ speed. Before we do so, let us first look at the distribution of runners’ speeds over all the races in our data set, measured in kilometers per hour:

We see that most of the speeds are between 4 and 6 kilometers per hour, with a long tail of faster runners. Two outliers were discarded in the making of this plot – speeds of 168 and 38, which don’t really make any sense at all (an amusing thought is that possibly a vehicle of some sort got mixed in with our data). More importantly, from our above graph it’s evident that the distribution of runners’ speed is bimodal (with one peak less than 5 kmph, and another peak greater than 5 kmph), which leads to the question of which features, if any, might potentially be contributing to this bimodality. We will ponder this question in more depth for our fourth question, after we’ve taken a look at all of the features in our data set that could influence runners’ speeds. However, it is important to note that the distribution of runners’ speeds is centered enough for it to be likely that the residual errors of runners’ speeds given race features in linear regression models are normally distributed as well, thus allowing us to apply linear regression models and draw statistically valid conclusions.

We now proceed to look at the relationship between race-specific features and runners’ speed:

From the leftmost subplot, we see that the total elevation gain of a race is strongly negatively correlated with a runner’s speed, which makes sense intuitively – we all know from experience how much harder it is to run up a hill than on a flat surface. Distance has only a slight effect, which makes sense because the races in our data set all have distances of similar magnitude, so there shouldn’t be much variation in pacing. Number of aid stations per kilometer has a pretty strong positive correlation with runners’ speed, which also makes sense since hydration and fueling are crucial components for long distance running performance.

When looking further at the relationship between elevation gain, distance, and the number of aid stations per kilometer versus speed, let’s first check whether our visual observations of speed having a negative correlation with the first two variables and a positive correlation with the third are statistically significant. We do so by constructing a linear regression model of the runner’s speed against each of the possibly related variables, and then examining the coefficients.

First, a linear regression model of speed against distance:

##               Estimate   Std. Error   t value Pr(>|t|)
## (Intercept) 23.7027189 0.1613014201  146.9467        0
## distance    -0.1133324 0.0009809375 -115.5347        0

Next, speed against elevation gain:

##                    Estimate   Std. Error   t value Pr(>|t|)
## (Intercept)     7.383263611 8.977695e-03  822.4008        0
## elevation_gain -0.000333449 1.188389e-06 -280.5891        0

And finally, speed against the number of aid stations per kilometer:

##                     Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept)         4.878055 0.008163199 597.56659  0.000000e+00
## aid_stations_per_km 3.157071 0.106576232  29.62266 5.243411e-192

Indeed, it is quite clear that each of the slopes are of the correct magnitudes (negative for elevation gain and distance, and positive for aid stations), and have p-values much less than 0.05; hence they are statistically significantly nonzero.

However, for the sake of playing the devil’s advocate, one might posit that perhaps elevation gain and distance are heavily correlated which potentially might render one of the two negative correlations with speed redundant with respect to causal inference, since perhaps one of them affects the other. This potentially makes sense, since perhaps longer distances have more elevation fluctuations simply due to added length on turbulent terrain.

Unfortunately, it turns out that there is most likely a positive correlation with elevation gain and the length of the race in kilometers, as shown by the regression line’s positive slope. A quick check (which we did not show here for brevity) shows that the slope is indeed statistically significant as well. This suggests that perhaps one of these features might influence the relationship between the other and speed, or maybe there might even be a third factor influencing both of these variables’ relationship with speed, such as the quality of terrain. However, the points on this scatter plot (colored by speed) do show us that races with lower distances and less elevation gain tend to have faster speeds than those with larger elevation gains and longer distances, and so it is likely that there is still a bit of a relationship here.

In addition to features pertaining to race conditions, one might also consider whether the region which a runner is from, or which region the race occurs in might influence a runner’s speed.

Looking at the above chart, we see that the runner’s home region doesn’t have much association with their speed. However, the race location does; for example, races which occur in the Middle East & North Africa or North America tend to have faster speeds than those which take place in Latin America or East Asia & Pacific. This is likely due to factors such as the average weather in each country; for example, races in North America likely have much cooler weather than in Latin America, which will enable runners to be faster. However, these observations are prone towards confounding factors such as region representation, which we discussed in further detail during our analysis of our first question.

Our thoughts about the weather in each region influencing runner performance led us to our next question: whether there is a relationship between the season the event took place in versus runners’ speed (a crude approximation to how weather conditions might affect runners’ speeds). To make the season-weather relationship stronger, we encoded the season of the only Southern hemisphere region (Sub-Saharan Africa) with the opposite seasons as the other regions.

It appears as if speeds are slightly faster in spring and winter than in summer and fall. However, all the boxes do overlap, and so we definitely can’t draw any strong conclusions from this graph. However, an Anova test tells us that there is indeed a significant variation in speed between different seasons, as it rejects the null hypothesis that speed and season are uncorrelated with a p-value of less than 2e-16:

##                Df Sum Sq Mean Sq F value Pr(>F)    
## season          3  13106    4369    2020 <2e-16 ***
## Residuals   99541 215286       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although the Anova test does not tell us specifically which seasons tend to have faster runners, based on our box plot we can tentatively say that runners tend to be faster in winter and spring. This aligns with our hypothesis that cooler weather allows racers to run faster, as winter and spring tend to be the coolest seasons. However, there may be some confounding effects; for example, maybe the best ultra-marathon runners tend to run more races in winter and spring.

3. Which qualities of a runner makes them more likely to win races?

A simple question one could ask pertaining to this overarching question is how the proportion of runners who win races varies based on the runner’s home region.

This plot shows us that runners from South Asia are by far the most likely to win races, with over 7.5% of runners from South Asia winning their race. South Asia is followed by the Middle East and Africa, with Europe and North America producing the fewest runners, proportional to the total number of runners from these regions. This mostly aligns with our earlier observations about which regions tend to have faster runners; again, it makes sense that in regions with fewer resources such as Africa one must be a very good runner to compete in these highly difficult events, while ultra-marathons may be both more popular among non-professional atheletes in Europe and North America, in addition to more accessible as more races take place in these two regions.

We next considered the distribution of winners vs non-winners by age and gender:

We see that the distribution of ages peaks at a much younger age for winners than non-winners (around 40 vs 50). This effect is especially noticeable for females, as the peak of the female density curve for winners is much higher than that of the peak for female non-winners’ ages. It makes sense that winners tend to be younger, as older runners are faced with more injury and are weaker in general. A Kolmogorov-Smirnov test confirms that the distributions of ages between winners and non-winners (not split by gender) are indeed different, as it rejects the null hypothesis with a p-value less than 2.2e-16.

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  winner_ages and non_winner_ages
## D = 0.25236, p-value < 2.2e-16
## alternative hypothesis: two-sided

Another interesting relationship to explore is the relationship between age for winners across races of varying elevation gain and distance. We can examine this relationship via a heat map:

Based on the colors, it appears as if there is not a significant relationship between the age of winners and either distance or elevation gain. However, the contour lines show that the winners of shorter races with less elevation gain tend to be slightly older than the winners of the other races.

4. Are there any significant clusters of runners or race by features?

We now take our analysis a step further, and ask ourselves whether there might be structure in our data that could be clustered into different groups of data points with different mean speeds. In doing so, we wish to consider distance, elevation gain, age, gender, and speed. We decided to ignore the region of runners and the races, since we observed earlier in question 3 that runners’ regions do not seem to affect performance at all, and there were too many confounding factors to consider to include race region so we decided to not include that for interpretability purposes. Since this is still quite a few features, we first reduce our data to two dimensions by using multidimensional scaling (MDS). We then rescaled our MDS features and determined how many clusters to include when applying kmeans with following elbow plot:

We see that the elbow of the plot is roughly around k=6, so therefore we infer that the data naturally best breaks into 6 clusters, as opposed to the two we posited earlier from the bimodal nature of runners’ speeds.

After reducing our features into the two dimensions found by MDS, and coloring points by the cluster labels found by K-means, we have the following graph:

In our plot, we see 5 large groups, and a sixth one far above the five clusters hugging a line parallel to the x-axis. An examination of our dataset reveals that the data points in this cluster are exactly those with ages reported as 0 years. Thus, the sixth cluster separates individuals with invalid age values.

We now look at which features seem to best separate the remaining five clusters, and their relationship with mean speed (by cluster):

##   clusters distance elevation_gain      age    gender    speed
## 1        1 161.7085       3067.975 41.66162 0.8850000 6.569407
## 2        2 163.4324       6699.171 27.10526 0.8818182 5.391977
## 3        3 163.2317       6469.903 45.95631 0.7889447 5.321734
## 4        4 162.1469       3552.108 57.35385 0.8333333 5.966642
## 5        5 167.5004       9960.685 43.08936 0.8579882 4.221491
## 6        6 166.7416       9821.219 57.16774 0.8250000 3.783099

It seems that lower elevation gain is associated with higher mean speed across the clusters, which is in line with our analysis from earlier. However, the ages seem to be quite scattered (except for the 6th cluster, which is for invalid values to begin with). We also observe that distances, and the proportion of males in the clusters remain relatively uncorrelated with speed across clusters. In summary, it seems that elevation gain seems to be the feature which most determines the structure of the clusters, as it exhibits the most variation, and is best correlated with the average speed of the clusters as well.

Note: The clustering algorithm has a random component, which results in getting slightly different results each time the HTML file is created. However, these are the results we found the first time we ran and interpreted the results, and which occur most of the time.

Conclusion and Next Steps

Summary of Findings

In summary, we looked at several features of races and characteristics of runners which influence speeds and rankings. We found that the number of aid stations has the greatest association with fast runners, followed by the amount of elevation gain in the race and then, to a much lesser extent, the race’s distance. We found that the region from which a runner is from and the region the race occurs in do not have as much of an affect as these other factors, but that speed does vary between seasons to a statistically significant degree. In terms of which characteristics of runners are correlated with winning races, we found that winners tend to be younger than nonwinners, especially in longer races with more elevation gain, and that runners from South Asia win a very high proportion of races (7.5%) when compared to runners from other regions. Lastly, we investigated whether there are any significant clusters in the data, and found that the datapoints were mostly grouped by elevation gain and age when we ran MDS and k-means.

Future Work

For future work, we’d like to start with the most pressing unanswered question: why exactly is the distribution of speed bimodal? To answer this, we would likely need a statistical tool which could inform us which regions of the distribution of runners’ speed is best correlated with underlying features, such as misrepresentation bias in the distribution of these features, as it was difficult for us to spot any patterns visually. We could also further our analysis in question four for which features best define the data’s clustering by employing a random forest regression model for predicting runners’ speeds and identifying the feature with the highest mean gini decreasing score (which means that feature which is the most important for predicting runners’ speeds). Additionally, instead of primarily considering runners’ speeds, we could employ advanced statistical measures to consider the influence of features over runners’ ranks as well, such as the Wilcoxon Rank Sum test between various partitions of our data based upon our features’ values.