Introduction

In a world where digitalization intertwines regions, countries, and cities, it is important to keep a pulse on not only the development within individual communities, but be able to compare and contrast them against each other as well. However, with the overabundance of information made readily available by the internet, it can be easy to get caught up in small details and overlook larger development metrics.

Our project focuses on tracking the relationship between metrics central to understanding the quality of life within major global regions. In order to do so, we explore trends in GDP growth across regions and income groups over time, examine the relationship between education levels and poverty rates, and investigate the impacts of environmental factors such as carbon dioxide emissions on health indicators such as death. With this project, we hope to explore complex socio-economic issues at a global scale and understand the dynamics shaping the world’s development trajectory.

Dataset Description

Our dataset comes from Carnegie Mellon University’s Statistics and Data Science Repository, and compiles a selection of 40 interesting variables for 266 countries and regions across 10 years, from 2013 to 2021. With variables covering various aspects of socio-economic development such as income, education, health, and environmental factors, this dataset allows for comprehensive exploratory data analysis (EDA) and visualization techniques. Of the 46 available variables, we focus on 9 in particular:

  1. Region: Geographic region the country is in (categorical)
  2. Country Name: Name of country or region (categorical)
  3. GDP: Gross domestic product, in current US$ (quantitative)
  4. Income: Adjusted net national income, in constant 2015 US$ (quantitative)
  5. CompulsoryEducation: Compulsory education, duration in years (quantitative)
  6. Poverty: Percentage of population who are multidimensionally poor (quantitative)
  7. CO2Emissions: CO2 emissions in metric tons per capita (quantitative)
  8. Death: Death rate, crude per 1,000 people (quantitative)
  9. Year: Year the values were observed in (ordinal)

Research Questions

Our project centers around three questions:

  1. How do GDP and income change across regions over time?

  2. What is the relationship, if there is one, between compulsory education rate and poverty rate?

  3. What is the relationship, if there is one, between carbon dioxide emissions and death?

Question 1

In order to explore the question of how GDP and income differ across different regions, we thought it would be appropriate to chart the bivariate relationships between GDP and Region and GNI and Region, since these were available variables in our dataset. While there are two separate graphs, they show very similar trends, which makes sense because GDP and GNI are very similar metrics with the exceptions of Net Exports. In the two graphs, we see that the total GDP and total GNI over the past 8 years is by far the greatest in East Asia & Pacific, Europe & Central Asia, and North America, which is highly logical due to the degree to which those regions are highly developed. This relationship lends itself to further questions and research, which is why it was critical to explore this relationship between the two metrics and Region, where more developed regions experience higher levels of GDP and GNI overall.

This graph presents the GDP growth over time by region from 2013 to 2021, with each region represented by a different color and individual bar per year. East Asia & Pacific, denoted in red, shows consistently high GDP figures, maintaining a significant lead over other regions throughout the observed period. North America, in blue, also demonstrates considerable economic size, with a relatively stable GDP that shows slight fluctuations but no drastic changes. On the other hand, regions such as Sub-Saharan Africa, in pink, and South Asia, in purple, display much lower GDP figures, indicating a smaller economic scale relative to other areas. There is a clear disparity between these regions and the economic powerhouses like East Asia & Pacific and North America. Overall, the graph highlights both the vast differences in GDP across regions and the stability or growth trends within each region over the nine-year span.

Question 2

In order to answer the question of the relationship between compulsory education and poverty rate, we found it necessary to graph the mean rates for both across regions from 2013-2021. For the graph of mean poverty rate, we see that the Sub Saharan Africa and Latin America & Caribbean regions have the highest poverty rates. An issue with our dataset is that we do not have any poverty rate data for North America, as the WDI dataset regards every North American country as highly developed. Interestingly enough, many Latin American countries in North America are part of the Latin American region, which could explain this trend. Meanwhile, in terms of mean compulsory education, we see that North America and Latin America have the highest compulsory education requirements, while Sub-Saharan Africa has the lowest compulsory education requirement. The differences between these graphs suggest that there might be a relationship worth studying between the two, and though Latin America might be an exception, the diversity and unique definition of the region in this dataset could make it an outlier rather than representative of the overall trend.

From the heat map, we notice that we don’t have data for all the countries. Majority of Africa does not have dots, which we will note as a big limitation of our data. A trend we see in the data is that Europe has relatively yellow dots, indicating that the compulsory education is in the middle range of 8-12 years of education. We see a range of circle sizes in Europe, so we can’t conclude if there is a pattern of lower poverty with higher education in Europe specifically. However, if we look at Southeast Asia, we see the dots are significantly smaller, indicating lower poverty rates. In large countries like Mexico, US, Canada, Russia, and India, we see large dots, which point to higher poverty, with varying levels of compulsory education. We conclude that in countries with larger economies, education systems, and immigration, there are other outside factors that contribute to poverty. An interesting feature of the heat map is the significantly high compulsory education in Latin America. Overall, we don’t have enough information to confirm a trend with compulsory education and poverty region to region, however, we can see a trend with poverty in countries of varying economies/population.

Question 3

The graph illustrates the relationship between CO2 emissions (log-transformed) and death rates across various global regions. Each region is represented by a distinct color and has a unique scatter of points along the axes. A log transformation on the CO2 emissions is applied to normalize the data, as CO2 emissions can vary exponentially between countries and regions. This transformation helps in handling skewness, reducing the effect of outliers, and making the scale more linear, which allows for better comparison and correlation analysis. From the spread of the points, it seems that for many regions, there is no clear trend or correlation between the transformed CO2 emissions and death rates, as the points are widely dispersed with no distinct pattern. Notably, Sub-Saharan Africa has a cluster of points at the lower end of CO2 emissions, indicating lower levels of industrialization, but the death rate seems high and consistent regardless of emissions. It can also be noted that in regions such as the Middle East & North Africa and South Asia, there seems to be a decrease in the death rate as CO2 emissions increase. A possible reason for this could be an increased CO2 emission rate due to urbanization, whereby we see that with more urbanization and better infrastructure, these regions tend to thrive more and we see an improvement in mortality rates.

This graph highlights the complexity of the relationship between environmental factors and health outcomes and suggests that factors other than CO2 emissions may be more influential on death rates in certain regions.

## 
## Call:
## lm(formula = Death ~ log(CO2Emissions), data = filtered_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3579 -1.5993 -0.3892  1.4758  7.9497 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.99249    0.06865 116.430  < 2e-16 ***
## log(CO2Emissions) -0.24229    0.04373  -5.541  3.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.503 on 1667 degrees of freedom
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01749 
## F-statistic:  30.7 on 1 and 1667 DF,  p-value: 3.497e-08

Despite our visual identification that there appears to be no clear trend or correlation between the transformed CO2 emissions and death rates, our regression analysis indicates that there is statistically significant evidence that the transformed CO2 emissions affect death rates. With a p-value of 3.497e-08, our regression analysis produces results significant at an alpha=0.05 level. However, we can note low multiple R-squared and adjusted R-squared values. This supports the idea that there are factors other than CO2 emissions that may be more influential on death rates in certain regions.

After our regression analysis above showed significant evidence supporting our initial hypothesis of correlation between CO2 Emissions and Death rate, but the low R-squared adjusted value inspired us to look at other potential reasons for an increase or decrease in deaths. A quick filter of the dataset would show that the top 4 instances of Battle Deaths in any year for any country all came from Syria, from the years 2013-2016 (with 2013 being the highest battle death count). This then led us to calculate, as seen above in the time series plot, the proportion of total deaths that was due to battle deaths for each year for Syria. This was done by taking the total population for each year and dividing it by 1000, followed by multiplying that resulting number with the corresponding death rate, which would give us the total number of deaths for that year (since death rate is for the number of deaths per 1000 people). Then for the numerator (no. of battle deaths), we have this data already in the original dataset, so taking this number divided by the total number of deaths gives us the proportion of deaths attributed to battle deaths for each year in Syria, as can be seen in the time series plot above.

It should be noted that this was during a prior conflict between Israel and Syria back in 2012-2016, which is why the battle death counts are so high during this period. The peak of this war was around 2012-2013, which is why we can see a clear decreasing trend in proportion over time as the war slowly subsided and there were less deaths attributed to battle.

Conclusion

Our analysis indicates a nuanced view of global development indicators. It is evident that GDP and GNI are highest in regions that are highly developed such as East Asia & Pacific, Europe & Central Asia, and North America, underlining the correlation between economic development and these metrics. The absence of poverty data for North America in the WDI dataset suggests limitations in the representation of development across regions, as evidenced by the high poverty rates in Sub-Saharan Africa and Latin America & Caribbean. The high compulsory education requirements in North America contrast sharply with Sub-Saharan Africa, indicating a potential avenue for further research into the impact of education on economic development.

However, the relationship between environmental and health indicators is less clear-cut. While the regression analysis of CO2 emissions and death rates shows statistical significance, the low R-squared values suggest that CO2 emissions are not the predominant factor affecting death rates, pointing to a more complex interplay of variables that govern regional health outcomes. Collectively, these insights emphasize the multifaceted nature of development indicators and the importance of considering a broad range of factors when evaluating the progress and challenges of different regions. Further research that addresses data gaps and explores these complex relationships is crucial for a more comprehensive understanding of global development.

Limitations and Further Research

Some limitations of our study include missing data, only up to eight years of previous data, and the lack of daily data for variables such as carbon dioxide emissions. The effects of these limitations are as follows: missing data may cause inaccuracies in our conclusions, more years of prior data would allow for more precise observations and conclusions, and the lack of daily data prevents the usage of time series plots.

Further research could expand on the role of urbanization and development using statistics such as birth (Birth), death caused by disease (DeathCD), deaths not caused by disease (DeathNCD), battle deaths (BattleDeaths), and more. It would be interesting to track development according to sociological stages of urbanization, especially with more developed regions such as North America and Europe/Central Asia.

Additionally, for the second research question, a large confounding factor is that even though less developed countries have very high compulsory education requirements yet high poverty rates, many of these said countries actually have very low educational completion rates. For example, Latin America’s educational completion rate is 46.4%, so even though the legal requirement of 16 years of education appears to be quite strict, the true matriculation rate being low implies that an undereducated population might be the cause of high poverty rates. This is a complicated relationship, but it illustrates that there are many possible factors that can influence poverty beyond a strict compulsory education cutoff. It would be interesting to explore further factors in the future, and examine the matriculation rates of different countries and regions to understand this trend.