A Deep Dive Into The Tokyo Olympics in 2021

Description of the Dataset and its Variables

The “2021 Tokyo Olympics” Dataset can be found at and contains information regarding over 11,000 athletes across 47 disciplines and from 743 different teams/countries. The dataset is split into different subsets, from Athletes information, Coaches, Gender, Medals Gained, and competing teams Furthermore, we included a dataset of macroeconomic information for each country to supplement the data, and provide complexity. Some variables included in the dataset are discipline, athlete names, coaches, athletes’ nationalities, and athlete genders. Within the macroeconomic data, we have attributes such as the gdp, population, GNIPC (GDP Per Capita), and MFLP (Male/Female Labor Participation).

In terms of data manipulation before we answered our questions, we initially took the athlete and medal sub-datasets and joined them with the macroeconomic information by the country. This allowed to us to connect both the country and its economic state with the athletes. We removed the coaches sub-dataset as we believed it was unnecessary for our overall purpose. We also removed missing values by removing the rows entirely due to the large sample set we had even without these missing values. Due to the the names of the dataset, we first had to clean the data as well by aligning all the countries under one name across all the datasets. We also joined the athlete, medals, gdp, pop, gnipc, and mflp datasets into one large dataset. We also grouped the countries into its respective continents. Ultimately, we have following variables to aid us in answering our questions:

  • Event: Mens/Womens Event (categorical)

  • Team (NOC) (Nationality): The nation that the athlete is playing under (categorical)

  • Gender: The gender of the sport’s event (categorical)

  • Discipline: The sport that is being played (categorical)

  • GDP: Gross Domestic Product of the country (US Dollars) (Quantitative)

  • Population: Population of the country (quantitative)

  • GNIPC: Gross National Income per Capita of the country, a proxy for living standards (US Dollars) (quantitative)

  • MFLP: The percentage such that for every 100 males that work in the country, how many females also work in relation (quantitative)

  • Continent: The continent the country resides (categorical)

  • Count: The number of athletes playing for the country (quantitative)

  • Total: Total count of medals gained by the country (quantitative)

As a note, the Tokyo 2021 Olympics occurred in 2021, while the macroeconomic data was most recently updated during 2020 as well as 2019. Thus, our information may be one year off, but it should not be of ultimate concern as it is simply a year’s worth of difference in the country’s overall economic state.

The Three Research Questions

As mentioned, with our Netflix dataset, we have three questions we would like to answer through data visualizations. Specifically:

  • Do countries with higher GDP lead to better performance at the Olympics?
  • Do countries prefer certain disciplines?
  • Does the gender distribution of athletes sent by each country reflect its male/female labor participation rate difference?

Research Question 1: Do countries with higher GDP lead to better performance at the Olympics?

The first research question that we posed for the dataset was if countries that had higher GDP values led to a better performance at the Olympics, which was defined by the number of medals that the country gained. Here, we specifically we had the GDP of the country, the total medals, and the number of athletes.

From our initial scatterplot, we note that clearly, the higher the GDP is of a country, the more athletes they seem to send to the 2021 Olympics. Note that the graph almost seems curvilinear in its correlation. This ultimately does seem to make sense with real-life logic, as the better the GDP of the country, the more resources they can spend on training centers, coaches, and more in order to produce stellar athletes.

We also wished to see a chloropeth map of the GDP per athlete. Since it is clear that the more athletes you send, the higher chance you have of winning more medals, we found it more applicable to find the GDP cost per athlete. We thus created a new variable that was the GDP of the country divided by the number of athletes that country sent. We then produced this chloropeth, which shows that the major regions like the US, China, and Russia seem to have the greatest GDP per athlete. Other parts of Africa and Asia also seem to have a high cap of GDP per Olympic Athlete. The chloropleth ultimately gives an insight particularly to which countries are putting more resources into the Olympics, as mentioned before.

Given the background knowledge we have of the countries with higher/lower GDP per athlete, we produce a scatterplot of the GDP per athlete and then the number of medals the country won. We also coded a categorical value of the number of athletes to get a better sense of how many athletes there are to the medals as well, alluding to our first graph. We clearly note that while there doesn’t seem to be a completely discernable pattern. Note that there are quite the many medals won by countries albeit their small GDP per athlete value. Note though that these countries also have a large amount of athletes being sent. Whilst there are some countries that have large GDP per athlete values, they rarely win many medals, and it also seems connected to the idea that few athletes are being sent. Ultimately, most countries rarely do win, and the countries that do have many more athletes, although their GDP per athlete value may be lower than others.

We recognize that countries that overall do have a higher GDP usually lead to more medals, but its important to recognize athletes are as much of importance as the GDP value. But as recognized, GDP is connected to athlete count, thus having higher GDP can lead to better performance, as shown by the countries’ performances at the Japan 2021 Olympics.

Research Question 2: Are continents better at different disciplines?

In a similar vein, we now want to explore whether disciplines are dominated by different continents; for instance, are North American/European countries better at Ice Hockey than Asian countries? To answer this question, we’ll be using the Athletes data set, which contains their names, nationality, competing discipline, and the continent of their nationality.

Though the process of determining who gets to compete in the Olympics is complex, part of the decision is dependent on the performance/world ranking of athletes. Hence, we can extrapolate that if there are more athletes from a particular continent, then the said continent tends to be better at the respective discipline on average. For this research question, we’ll be looking more closely at the distribution of continent and discipline.

In our Athletes data, there were 46 disciplines in total. Due to the high number of disciplines, we chose to explore the three most commonly competed. Note that these disciplines would have the highest number of competing athletes, which allows for more diverse set of participating countries.

The bar plot above shows the Athletics, Swimming, and Football have the highest number of competing athletes. However, since football is team sport, there are only 24 countries that actually participated whereas Rowing had 78 participating countries. So, for a more interesting result, we will explore which continents competed Athletic, Swimming, and Rowing.

Before describing the plot, there were 11 athletes who were not assigned to a continent. To be more specific, there were 7 athletes competing in Athletics and 4 athletes competing in Swimming.

In our faceted bar plot, we clearly see that participating continents are not equally distributed. Something interesting to note is that the top three most common continents for all three disciplines are Europe, Asia, and the Americas.

In all three disciplines, European athletes were most common. For Athletics, athletes from the Americas were the next common and then Asia. Similarly, athletes from the American were the next common and then Asia. Swimming followed a similar pattern in which Asia was the second most common and then the Americas.

Though it seems like there is a statistically significant difference in the distribution of competing athletes, we will conduct Chi Square Test for each discipline. For all three tests, our significance level will be at \(0.05\); our null hypotheses for the three tests are: the proportion of continents where athletes are from is independent from the competing discipline. Moreover, our alternative hypotheses for all three are: the proportion of continents where athletes are from is not independent from the competing discipline.

## 
##  Chi-squared test for given probabilities
## 
## data:  table(athletics$Continent)
## X-squared = 945.84, df = 4, p-value < 2.2e-16
## 
##  Chi-squared test for given probabilities
## 
## data:  table(rowing$Continent)
## X-squared = 432.43, df = 4, p-value < 2.2e-16
## 
##  Chi-squared test for given probabilities
## 
## data:  table(swimming$Continent)
## X-squared = 260.76, df = 4, p-value < 2.2e-16

As shown, all three tests had a p-value that was approximately zero. Hence, we have sufficient evidence to reject the null hypothesis and conclude that the proportion of competing continents is not equal for Athletics, Rowing, and Swimming.

To observe which continents occur more or less frequently, we provide a mosaic plot below.

For Athletics, it seems like there are significantly more athletes from the Americas and Africa and less athletes from Oceania than expected under independence. For Rowing, there are significantly more athletes from Europe and Oceania and less athletes from Asia, the Americas, and Africa than expected under independence. For Swimming, there significant more athletes from Asia than expected under independence.

Based on the bar plots, one would interpret that there are significantly more athletes from Europe than any other continents for all three disciplines. However, from the mosaic plot, it seems like there are significantly more European athletes in only Rowing. Thus, from our visualization, it seems that some continents do perform better than other countries.

Research Question 3: Does the gender distribution of athletes sent by each country reflect its male/female labor participation rate difference?

The essence of this question is seeing whether a country with more ‘restrictive’ values for women would also send less women to the Olympics. Our immediate hypothesis is that for some of these countries, especially some more conservative and dominantly Muslim countries, women have both less access to sports/talent development, and less encouragement to participate in sporting events such as the Olympics. On the other hand, more egalitarian countries, such as those in Scandinavia might have a more equitable mix between male and female athletes in their delegations. Furthermore, we think it might be possible that some countries with traditionally matriarchal societies may actually send significantly more women than men.

An issue to consider is the existence of some country inclusivity measures in some events, mostly for the “athletics” umbrella as well as swimming. These measures essentially invite 1-2 athletes from small countries into qualifying rounds for these events, even if a given country has no athlete that would qualify for the Olympics by merit of performance alone. This results in many small countries having 1-2 athletes, which obviously results in high variance for the gender distribution. We thus will work with countries with 5 or more athletes, and do a subset analysis of countries sending over 20 athletes. Although we lose many observations (206 -> 84), especially less egalitarian countries, both sets of results are actually extremely similar and robust.

## 
## All Countries
## ===========================================================
##                                    Dependent variable:     
##                                ----------------------------
##                                           `2019`           
##                                     (1)            (2)     
## -----------------------------------------------------------
## percent_women                     0.420***                 
##                                   (0.085)                  
##                                                            
## gender_diff                                     0.391***   
##                                                  (0.085)   
##                                                            
## Constant                         53.987***      63.091***  
##                                   (3.895)        (2.365)   
##                                                            
## -----------------------------------------------------------
## Observations                        169            169     
## R2                                 0.127          0.113    
## Adjusted R2                        0.122          0.107    
## Residual Std. Error (df = 167)     17.697        17.840    
## F Statistic (df = 1; 167)        24.243***      21.184***  
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01

## 
## Large Countries (N > 20)
## ==========================================================
##                                   Dependent variable:     
##                               ----------------------------
##                                          `2019`           
##                                    (1)            (2)     
## ----------------------------------------------------------
## percent_women                    0.532***                 
##                                  (0.129)                  
##                                                           
## gender_diff                                    0.373***   
##                                                 (0.122)   
##                                                           
## Constant                        48.304***      63.576***  
##                                  (6.145)        (3.476)   
##                                                           
## ----------------------------------------------------------
## Observations                        81            81      
## R2                                0.178          0.106    
## Adjusted R2                       0.167          0.095    
## Residual Std. Error (df = 79)     15.444        16.104    
## F Statistic (df = 1; 79)        17.095***      9.370***   
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

Here, we take World Bank data about the relative labor participation rates between males and females (abbreviated MFLP). MFLP measures how many females are in the labor force for every 100 males in the labor force of a given country. We conjecture that countries which are more egalitarian in their labor force will also send a more equal team of athletes in term of gender, while countries in which men are the predominant workers will on average send a team with relatively more men. We obviously see this is the case, with a clear positive correlation between MFLP and number of women athletes relative to male athletes/total athletes (\(\beta_{intercept} = 0.39^{***}\)). After accounting for small countries sending only a few athletes which have much higher variance in the ratios by only considering the subset of countries which sent more than 20 athletes, we still have nearly identical results (\(\beta_{intercept} = 0.37^{**}\)).

This is another way to visualize the same data, but the map format allows us to identify MFLP and athlete gender ratio values for specific countries. We note that Middle Eastern countries where religion and culture dictate a heavy male-dominance in labor also correspond strongly to male-skewed Olympic delegations, while more egalitarian societies (e.g. Scandinavian countries) send more balanced delegations. Finally, observe some traditionally matriarchal societies (Saharan Africa), a number of which also send female-skewed delegations.

We can conclude the gender ratios of Olympic delegations are definitely associated, and on average, reflect the M/F labor participation gaps in the given countries. This can be attributed to many shared factors, including culture, religion, and societal structure/support for different activities amongst men and women.