Examining the Influence of Socioeconomic Factors on Premature Death Rates Across Racial Groups

Authors

Tasnim Rida

Lissandro Alvarado

Oreoluwa Williams

Naomi James

Princess Allotey [TA Advisor]

Published

July 26, 2024


Introduction

A person’s health is determined more by their zip code than by genetics or family history (Morgan, 2019). The conditions in which people live, work, learn, play, and worship, and the systems that shape daily life, broadly deemed Social Determinants of Health (SDOH), are deciding factors in health outcomes. Within these broad influences, socioeconomic disparities in income, education, and employment play a critical role in shaping health outcomes. Moreover, historical and ongoing biases have systematically hindered socioeconomic progress for racial minorities.

For instance, low socioeconomic status has been associated with varying degrees of consequential health outcomes. Lower levels of education are associated with poorer health literacy, affecting an individual’s ability to make informed health decisions and navigate the healthcare system effectively. Similarly, unemployment or low-paying jobs lead to chronic stress, poor mental health, and limited access to resources needed for a healthy lifestyle.

These factors contribute significantly to health disparities; this analysis specifically focuses on the health outcome of premature deaths. Broadly defined, premature death is when death occurs before the average age of death for a given population (National Cancer Institute, 2011). Given that premature deaths are often preventable deaths, examining this measure in conjunction with socioeconomic factors can reveal specific areas of improvement for the United States healthcare system.

Our county-level analysis focuses on the following question: Do income inequality, unemployment, and high school completion rates affect the number of premature deaths of certain racial groups at the county level? We hypothesize that these socioeconomic factors will emerge as significant predictors of premature deaths.

Data

The dataset is from the 2024 County Health Rankings Data from the University of Wisconsin Population Health Institute (2023 Measures County Health Rankings & Roadmaps, n.d.). This dataset ranks counties within each state based on various health outcomes and health factors. Our variables of interest are as follows:

Predictor Variables

  • Income Inequality: Household income ratio at the 80th percentile to income at the 20th percentile
  • Unemployment: Percentage of the population ages 16 and older unemployed but seeking work
  • High School Completion: Percentage of adults ages 25 and over with a high school diploma or equivalent
  • Percent Population: Percentages of the population by county
    • Five columns broken down by race: American Indian or Alaska Native (AIAN), Asian and Pacific Islander (AAPI), Non-Hispanic Black, Hispanic, Non-Hispanic White, Native Hawaiian or Other Pacific Islander (NHOPI), Non-Hispanic White

Response Variable

The response variable we are interested in predicting is years of premature death. Before our analysis, we removed counties with missing values for premature death rate.

  • Premature Death: Years of potential life lost before age 75 per 100,000 population (age-adjusted to standardize age comparisons).

Exploratory Data Analysis

We began with exploratory data analysis of the response variable as well as the socioeconomic variables of interest.

Premature Death

Figure 1: Choropleth map on total years lost before 75 years old per 100,000 people per county in the U.S.

Figure 1 shows the quantity of premature deaths per county in the US. There seems to be a small pattern in the regions with higher premature deaths, primarily around Appalachia, Deep South, Northern Plains, regions characterized by higher levels of poverty and socioeconomic challenges. On the other hand, lighter counties with less counts of premature deaths tend to be concentrated on the West Coast, Northeast, and Upper Midwest, areas that tend to have less health disparities.

Racial Categories

Figure 2: Bar plot of median rate of premature death per race.

There is a clear population size difference between races in the United States. Since the race with the highest population will inherently have the highest premature death count, we normalized the population size by multiplying the percentage population for every race per county with the overall population size per county. We obtain the new premature death value by dividing the number of premature deaths by the population size of the respective racial group.

Graphing the median rate of premature death per county as shown in Figure 2, we discovered that AIAN tend to experience more premature deaths with a median rate of 4 years lost for every 100,000 people in this ethnicity. This is followed by NHOPI and Black populations. This shows that there are some patterns of premature death depending on race.

Socioeconomic Factors

Finally, we used scatter plots to visualize the relationship between each of the specified socioeconomic factors and premature death. The red line indicates the linear model line of best fit.

Figure 3: Scatter plots with linear regression lines of best fit with standard errors displaying the relationship between premature death rates and three socioeconomic factors: income inequality rate, unemployment rate, and high school completion rate.

Figure 3 shows us reasonably linear relationships between each of the predictor variables and premature deaths. The relationship between income inequality and premature death appears to be positive, indicating that as the income inequality in a given county increases, the years of premature death also increases. Similarly, the association between unemployment and premature death is also positive; as the unemployment percentage in a given county increases, the years of premature death also increases. Following the same trend, the relationship between high school completion is negatively correlated; as the high school completion percentage in a given county increases, the years of premature death decreases.

Intuitively, the socioeconomic predictors income inequality, unemployment, and high school completion are related to each other. However, using a correlation plot in further exploratory data analysis revealed no significant concerns about multicollinearity between the predictor variables.

Methods

In order to answer our research question, we built several statistical models to explore the relationship between premature death and our socioeconomic factors of interest: income inequality, unemployment, high school completion, and percent populations of each race.

Multiple Linear Regression

The results of our exploratory data analysis showed us that the relationships between the predictor variables and the response variable are reasonably linear, so we chose to begin our modeling with a multiple linear regression model:

Y = \beta_{0} + \beta_{1}X_1 + ... + \beta_{n}X_n + \epsilon

where Y is the response variable, \beta_{0} is the intercept, \beta_{n} are the coefficients, X_n are the predictor variables, and \epsilon is the error term.

The model above estimates a linear relationship between multiple predictor variables and a response variable. Multiple linear regression estimates parameters from data using the least squares method, which minimizes the sum of the squares of the residuals (the difference between the observed and predicted value). The assumptions of multiple linear regression are linearity between the predictors and response, the errors are normally distributed, the errors have equal variance (homoscedasticity), and independence between observations.

Using multiple linear regression in this case will enable us to estimate the coefficients for our socioeconomic predictor variables, providing insight into how they influence premature death. While linear regression is simple to implement and easily interpretable, it is not able to capture complex, nonlinear relationships in data.

Gradient Boosted Tree

We chose to next model our data with a decision tree, a highly accurate model that avoids overfitting, in order to investigate any complex relationships the multiple linear regression could not estimate. Decision trees are non-parametric models used to display relationships in data. Regression trees (the type of decision tree we chose) are used to model the relationship between predictor variables and a continuous output variable. To build a regression tree, the data is split with recursive binary splitting until a stopping criteria (such as maximum depth) has been reached.

We decided to gradient boost our decision tree to improve the accuracy of our predictions. The boosting algorithm sequentially builds multiple trees where each model in the sequence slightly improves upon the predictions of the previous models. The algorithm achieves this by focusing on the observations with the largest residuals, reducing the risk of overfitting.

Model Selection

We evaluated these models using cross-validation to prevent overfitting and comparison of root mean squared errors (RMSE) to obtain the measure of error. The RMSE is calculated by finding the differences between predicted values and actual values, squaring the errors, taking the mean, and then finding the square root. RMSE is commonly used to evaluate model performance, with lower values of RMSE indicating better predictive accuracy.

Ultimately, we selected the model with the best predictive performance on the test set (the model with the lowest RMSE). The final model we selected outperforms the other models we evaluated in predicting premature death rates in each county based on specified socioeconomic factors.

Results

We began by fitting the multiple linear regression and a gradient boosted tree. We then proceeded to model evaluation by calculating RMSE values to select the best model for predicting premature death rate.

Multiple Linear Regression

The results of fitting the multiple linear regression yielded significant coefficient estimates for all predictor variables (p < 0.05). Examination of the diagnostic plots revealed that the model reasonably fulfills the assumptions of multiple linear regression.

Figure 4: Bar chart of multiple linear regression model coefficients with standard errors where orange indicates positive coefficients and beige indicates negative coefficients.

Figure 4 verifies our hypothesis that income inequality, unemployment, high school completion, and race are significant predictors of premature death. The interpretation of our socioeconomic factors of interest are as follows:

  • In a given county, as income inequality increases (\beta_{1}=721.71, 95% CI:[604, 839]), years of premature death increases (holding all else constant).

  • An increase in unemployment percentages is associated with an increase (\beta_{2}=28451.64, 95% CI:[21323, 35580]) in premature deaths (holding all else constant).

  • Holding all else constant, as high school completion percentages in a given county increase (\beta_{3}=-22523.18, 95% CI:[ -24403, -20644]), premature deaths decrease.

The coefficients of the racial predictor variables have interesting results; the model estimates positive coefficients for the NHOPI, AIAN, and Black racial groups while estimating negative coefficients for the Hispanic and AAPI racial groups. We interpret one each positive and negative coefficient below:

  • For a given county, as the percent population of NHOPI increases (\beta_{8}=40158.40, 95% CI:[20341, 59975]), the years of premature death increases (holding all else constant).

  • An increase in the percent population of Hispanic (\beta_{7}=-4649.65, 95% CI:[-5331, -3968]) is associated with a decrease in years of premature death, all else equal.

The negative coefficients for underrepresented racial groups, particularly Hispanic populations, are surprising. Notably, the 95% confidence intervals displayed indicate significance as none of the errors include zero.

Finally, we also examined variable importance to identify which variables are most important for making accurate predictions. Based on Figure 3, the variable with the highest importance is the percent population of AIAN. The variable with the lowest importance is percent population of NHOPI which notably has the largest coefficient estimate. Interestingly, the socioeconomic factor with most importance is high school completion.

Figure 5: Bar plot of variable importance for the multiple linear regression model using values calculated from the vip package.

Gradient Boosted Tree

Next, we fit a decision tree to our selected variables: premature deaths, income inequality, unemployment, high school completion, and each racial group variable. After using the XGBoost package to fit the model and tuning using the best hyperparameters (determined by 5-fold cross validation), our results revealed insights that contrast with those from our multiple linear regression analysis.

Figure 6: Scatter plot of observed premature death rates versus the predicted premature death rates from the gradient boosted tree model; the red dashed line has a slope of one, representing perfect prediction accuracy.

The plotted points in Figure 6 fall mostly along the red dashed line, indicating the predictions made by the gradient boosted tree are fairly similar to the original observed values.

We also examined variable importance, as we did with our multiple linear regression. The variable with the highest importance is now high school completion, in contrast with the multiple linear regression model’s variable with the highest importance (percent population of AIAN). The variable with the lowest importance is percent population of NHOPI, the same as the multiple linear regression model.

Figure 7: Bar plot of variable importance for the gradient boosted tree model using values calculated from the vip package.

Model Evaluation

We used cross validation to test the performance of the linear model and gradient boosted tree. In Figure 7, each black dot in the plot represents a single test fold in the cross validation process, while the red dot represents the mean RMSE. The error bars represent the standard error, indicating the variability of the mean RMSE.

Linear models predictions have a RMSE of approximately 2243 potential years lost per 100,000 people, while random forest has a RMSE of 1846. This means that the linear model will deviate from the actual premature death predicted by 2243 potential years lost per 100,000 people, while random forest deviates by 1846 years. Additionally, the error bars for the gradient boosted model are shorter than the ones from the linear model, indicating less variability in the mean RMSE. This proves that the gradient boosted tree is a more reliable model and thus the final model we selected.

Figure 7: Dot plot with error bars comparing the RMSE of the gradient boosted tree and the multiple linear regression model.

Discussion

The results of our analysis have offered us several key insights into socioeconomic factors influencing premature death for certain racial groups. We constructed two models and performed model evaluation to select our final model: the gradient boosted tree.

The RMSE for the gradient boosted tree is significantly less than the RMSE for the multiple linear regression model. Thus, we can use the gradient boosted tree to accurately predict the change in years of premature death for each racial category included in the model (AIAN, (AAPI), Black, Hispanic, Native Hawaiian or Other Pacific Islander (NHOPI)), income inequality, unemployment, and high school completion.

Further analysis of the gradient boosted tree included variable importance, through which we determined high school completion is the most significant variable, while the percent of the population identifying as NHOPI is the least significant. This finding contrasts with the multiple linear regression results, where high school completion was the second most important variable. This difference in variable importance highlights that the gradient boosted tree effectively captured relationships in the data that the multiple linear regression did not. Multiple linear regression instead captured percent population AIAN as the most important variable, which can potentially be explained by the low percentages of the AIAN racial group in the United States. Notably, both the gradient boosted tree and the multiple linear regression identified NHOPI as the least important variable, likely due to the similarly low percentages of the NHOPI racial group in the United States.

Limitations

There were several limitations to this analysis. The county-level analysis does not provide insights for individual data which limits our ability to understand specific disparities within a county. Additionally, the dataset does not include information on racial breakdown on income inequality, unemployment, and high school completion; thus, we were unable to model complex relationships between race and the other socioeconomic factors of interest. Furthermore, the lack of detailed racial breakdown data prevented us from using racial group as a categorical predictor variable. This limitation decreases the ease of interpretability of our final model regarding the influence of race on premature deaths. Additionally, the variable importance scores from the multiple linear regression model simply model singularly important variables; it does not calculate potential combinations of variable importance that can result in building models that yield greater predictive accuracy.

Future Work

Given the limitations of our study, future research should focus on obtaining racial breakdown data for socioeconomic factors of interest. Exploring the relationships between income inequality, unemployment, and high school completion with premature deaths for each racial group could uncover new insights and trends, allowing for the development of interaction models to explore these complex relationships thoroughly.

Furthermore, future research can also incorporate more demographic factors such as age and gender to further understand potential relationships to predict premature deaths. Finally, increasing sample size of the data set used to model these relationships can increase reliability and accuracy of the analysis.

References

  • 2023 Measures | County Health Rankings & Roadmaps. (n.d.). https://www.countyhealthrankings.org/health-data/county-health-rankings-measures

  • Morgan, K. (2019, April 22). Story from Blue Cross Blue Shield Association: Up to 60% of our health is determined by zip code. USA TODAY. https://www.usatoday.com/story/sponsor-story/blue-cross-blue-shield-association/2019/04/22/up-60-our-health-determined-zip-code/3542001002/#:~:text=Social%20determinants%20of%20health%20%E2%80%93%20how%20our%20environments

  • National Cancer Institute. (2011, February 2). https://www.cancer.gov/publications/dictionaries/cancer-terms/def/premature-death

  • U.S. Department of Labor. (2021). FOREIGN-BORN WORKERS: LABOR FORCE CHARACTERISTICS — 2015. https://www.bls.gov/news.release/pdf/forbrn.pdf

Appendix

Table A1: Multiple Linear Regression Model Summary
Term Estimate SE t p
(Intercept) 25393.56 989.98 25.65 0
Income Inequality 721.71 60.01 12.03 0
Unemployment 28451.64 3635.69 7.83 0
High School Completion -22523.18 958.60 -23.50 0
Percent Asian and Pacific Islander -25517.53 1800.40 -14.17 0
Percent Black 5034.43 334.07 15.07 0
Percent American Indian or Alaska Native 17405.13 543.84 32.00 0
Percent Hispanic -4649.65 347.74 -13.37 0
Percent Native Hawaiian or Other Pacific Islander 40158.40 10106.88 3.97 0

Figure A2: Diagnostic plots for multiple linear regression modeling.