Our project takes a look at a dataset of Olympic Athlete Participation from the original 1896 Olympics until the present day.
This dataset consists of 2 CSV files, athlete_events.csv and noc_regions.csv. The file athlete_events.csv looks at data from 1896 to 2016 Olympic athletes, identified and listed by name, and with other variables about each athlete and event they competed in listed, including. It has 271116 rows and 15 columns, although one of the columns is just a unique ID for each row and one of the columns is a name. Each observation (row) corresponds to an event participation record. That is, each row indicates a certain athlete’s participation in a certain Olympic event at a certain Olympic Games. The variables in the file are as follows:
ID
: A unique identifier for each row in the datasetName
: The name of the athlete who participated in the eventSex
(Categorical): the gender of the athlete, M or FAge
(Numerical): the age of the athlete in yearsHeight
(Numerical): the height of the athlete in centimetersWeight
(Numerical): the weight of the athlete in kilogramsTeam
(Categorical): the team the athlete was on at the time of the eventNOC
(Categorical): the National Olympic Committee Abbreviation of the athleteGames
(Categorical): the Olympic Games the athlete participated in in ‘Year Season’ formatYear
(Numerical): the year of participation by the athleteSeason
(Categorical): the season the Olympic Games occurred, either Summer or WinterCity
(Categorical): the city the Olympic Games was played inSport
(Categorical): the sport the athlete didEvent
(Categorical): the event the athlete competed inMedal
(Categorical): what kind of medal the athlete received, if any. Either Gold, Silver, Bronze, or NA for no medalAdditionally, since we have the height and weight of each athlete, we decided to also calculate BMI in order to standardize the fitness of each athlete rather than measuring by height or weight alone (i.e. we decided it wouldn’t make much sense to compare the weights of athletes who are 5-foot-nothing to those who are over 6 feet tall across sports). We did so using the calculation \(BMI = \dfrac{weight (kg)}{[height (cm)]^2} \times 10,000\) and added this attribute as another variable in our data frame.
The file noc_regions.csv
contains a small amount of information regarding the correlation of National Olympic Committee abbreviations to modern-day countries. It has 231 rows and 3 columns. Each row corresponds to a unique country, and when used in tandem with the athlete_events.csv can match each athlete to a modern-day country. The variables in the file are as follows:
NOC
(Categorical) the National Olympic Committee Abbreviation of a modern-day countryregion
(Categorical) the country for the NOC Abbreviationnotes
: general notes about the region, not relevant to our analysisWhen looking at the data, we decided that we wanted to explore the following research questions:
The following graph begins to address if there are factors that affect athlete BMI. We first take a look at whether there is a potential correlation between Olympic athlete BMI and age. The following scatterplot contains points for every athlete colored by sex and with a regression line as well as density lines to show where the center of the data is.
Here we are trying to analyze the BMI by age for olympic athletes. We would expect Olympic athletes over the years to have pretty normal/healthy BMIs around ~20. Much of the data (as seen by the contours) is centered around age 25 and BMI 21 but there is also a significant amount of variation. Some athletes are also considered super obese on the BMI scale (probably because they don’t take into account the insane body proportions that are unusual with these athletes–high muscle mass, low fat, etc.).
Also, there is a generally positive trend with this data. This means that in general BMI goes up with age which is interesting (why is this? Are older athletes not as healthy, or have they had more time and experience to accumulate muscle mass?). Also, it looks like males have higher BMIs than females, especially in the outliers.
To further our dive into how BMI is affected by different factors, we wanted to see how BMI is affected by sex. To do this, we plotted a time series line graph of the average BMI of each sex.
##
## Welch Two Sample t-test
##
## data: BMI_Male and BMI_Female
## t = 26.256, df = 60.179, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.121736 2.471667
## sample estimates:
## mean of x mean of y
## 23.51301 21.21631
We see that, over time, there are fluctuations in BMI, and they seem to correlate in terms of additive changes to relative BMI’s for each sex. However, we also see that overall, the BMI for men is always about 2 higher than that for the women. We even did a two-sample t-test to analyze whether there is a significant difference between the two sexes in terms of average BMI across time, and found that to be the case, using an \(\alpha = .05\) - our p-value for this was less than \(2.2 \times 10^{-16}\).
Additionally, we see that there was a maximum of about 24.5 BMI for men (sometime around 1895-1900) that dipped down to a minimum of 23.0 (around 1905-1910) over a span of 15 years, before fluctuating around the middle of those two values over time. For women, on the other hand, we see somewhat larger fluctuations of this sort, and later, over the time span between a maximum of about 21.75 (around 1955-1960) before dropping down to about 20.75 (around 1985-1990).
We also wanted to analyze whether there is any sort of variation in average BMI across countries to further our research into factors affecting BMI. As a result, we plotted a spatial map of BMI across the world, based on participating countries with which we \(\textit{could}\) map the BMI.
After mapping BMI across countries to our best ability, we found that the majority of countries had athletes’ BMI’s in the 20-25 range, with a few outliers in places like Ethiopia and Kyrgyzstan - average Ethiopian BMI could be considered on the underweight side, and average Kyrgyzstan BMI could be considered to be on the overweight side, for example, and this could point towards a specialization in sports that would consider either higher strength-to-weight ratios (such as running events) or higher amounts of muscle mass to be advantageous (such as weightlifting).
For our next research question, we wanted to know if there was a relationship between BMI and Olympic success. To try and answer this we wanted look at the distribution of BMI by the medals that the athletes got: Gold, Silver, Bronze, or No Medal. To create this plot we facetted by sex as men and women usually have differing BMIs, as we have shown previously.
There is virtually no difference between the boxplot of BMI by medal type. From this we can assume that BMI is not a significant factor in Olympic success in the aggregate.
Next, we looked at Olympic Success, and how it differed by country. This graph attempts to analyze how Olympic success has changed over time for different countries. The below time series line graphs highlight the top ten countries (by medal count) for each season over time. The individual points show where data exists.
With this graph, we wanted to see how the medal counts for different countries change over time. In general, since there are more events as time goes on, there is a general trend upwards in won medals for most of the countries. It’s also clear that there are a lot more medals to be won in Summer Olympic games rather than Winter. The summer games also started much earlier than winter.
Of course, the Olympic games are very politically influenced events and the effects of those politics can be seen here if we look a bit more closely. The Olympic games in 1940 and 1944 were canceled due to WWII and you can see there are no data points during those years. Additionally, 1980 was an interesting year because of an ongoing American boycott in which the U.S. protested the Soviet invasion of Afghanistan. The U.S. convinced some other countries not to attend and you can see there are no data points for the U.S. in the Summer Olympics of 1980. There are plenty of other interesting cases throughout history that align with ongoing global events like wars (and now pandemics) that can often be realized in this medal count data by country.
We wanted to further explore whether or not Olympic success differed by country, so a slightly more detailed look at the levels of success was needed. In order to accomplish this, we plotted Olympic Season, Medal Type, and Country using a stacked barplot that is facetted by the country. For the sake of cleanliness and a sense of normalization, we only plotted the top ten medal winning countries over time.
This plot makes it clear that some top countries win far more medals at Summer Olympic Games than they do at Winter Games. However, for the countries that aren’t as far apart for both seasons (such as Italy), there are some interesting results regarding the differences in the distribution of medal types by season. The US, for example, has a higher proportion of silver medals won at winter games than they do at summer games, where they have won a higher proportion of gold medals. Canada on the other hand has won a higher proportion of gold medals at winter games than summer, where more of its medals won have been bronze. Russia also has won a higher proportion of gold medals at winter games than summer, where its medal distribution is more even.
This plot is definitely informative given the fact that we were really interested in medal counts for different countries. We knew that there were countries who won a lot of medals, but we wanted to learn if that was just a result of Summer Olympics, or even if those medals were mostly bronze, silver, or gold.
Finally, we wanted to explore the number of olympic event participants for each country over time. Doing this for every country would be difficult to do in a meaningful way on one plot. Because of this we decided to only display the 10 countries with the highest number of participants overall. We also wanted to facet by season, as this would show a smoother and more meaning trend of participants over time. This relates to both our research questions of olympic success by country and Olmypic participation over time.
The summer Olympics have overall much higher numbers of event participants than winter, yet both summer and winter games see a general increase in number of participants over time. The only countries that see a general decrease are the US, France and Great Britain, which all had a peak over 900 event participants.
Overall, we were able to glean some very interesting results from our analysis of the Olympic data. First, we took a look at the factors that affected an athlete’s BMI and learned that there are a few that make a difference. An athlete’s age, sex, and country of representation all had a discernible relationship with and impact on their BMI as shown in our graphs. Next, we examined whether BMI affected athlete success at the Olympics. However, it does not seem that there are significant differences between the BMI distributions for any category of medal or not even winning a medal at all. Based on this, it looks like BMI really doesn’t matter as much when it comes to winning! After finding out that BMI doesn’t really affect medal wins, we then looked into how medal wins differ across countries, namely the top 10 medal winning countries specifically. It seems like there is an upward trend of medals won for most of those countries generally. However, we also found out that their medal distributions do differ across season, most notably the US, Canada, and Russia. Finally, we took a look at Olympic participation by country and found that the number of participants in Olympics have generally increased over time, with the exception of a few countries. Overall, our analysis of this dataset has revealed fascinating insights that further inform our knowledge of the Olympics and the athletes involved, as well as their countries.