Premature Death Prediction Model Using Health Behavior Data

Introduction

Question: How can we predict premature deaths using county level health behavior data?
Question: Which particular health behaviors are most influential in modeling premature deaths?

The data given from County Health Rankings allowed us to analyze the community health approach that stresses various variables that determine how long and how well we live. The data given allows us to assist communities in understanding how healthy citizens are and what type of health behaviors they possess.

Motivation

Premature Mortality is a measure of length of life that focuses on preventable deaths, and this is done by increasing the weight of deaths of younger people. In the County Health Rankings data set, deaths that occur at ages significantly lower than 75 are counted more than deaths that occur at or above 75. This adjustment allows researchers to analyze which deaths are occurring because of life factors, such as poor health behaviors or physical environment, instead of old age. In the case of this analysis, it will allow focus to be on counties where citizens are exhibiting high rates of poor health behaviors but not on counties where the population is elderly.

Health behaviors are a leading cause of illness and death in the United states. Efforts to improve public health require information on the prevalence of health behaviors in populations. Studies conducted concluded that approximate half of all deaths in the Unites States could be attributed to factors such as smoking, physical inactivity, poor diet, and alcohol use. Identifying which health behaviors most contribute to premature deaths is is useful for health program planning and evaluation.

Data

Our data set was gathered from the County Health Rankings Model.¹ This model contains community health information across 2,853 US counties that emphasizes the many factors that influence how long and how well we live. Specifically, we focused on county-level premature deaths and health behavior factors for our analysis.

Health behaviors are actions individuals take that affect their health. They include actions that lead to improved health, such as eating well and being physically active, and actions that increase one’s risk of disease, such as smoking, excessive alcohol intake, and risky sexual behavior. Below we provide descriptions of each variable within every health behavior category.

Data Description

Response

Premature Death: Years of potential life (YPLL) lost before age 75 per 100,000 population (2018-2020)

Predictors

Alcohol

Excessive Drinking: Percentage of adults reporting binge or heavy drinking per 100,000 population (2019)
Alcohol-Impaired Driving Deaths: Percentage of driving deaths with alcohol involvement (2016-2020)

Diet and Exercise

Adult Obesity: Percentage of the adult population that reports a body mass index (BMI) greater than or equal to 30 kg/m2 (2019)
Food Environment Index: Index of factors that contribute to a healthy food environment, from 0 (worst) to 10 (best) (2019)
Physical Inactivity: Percentage of adults age 18 and over reporting no leisure-time physical activity (age-adjusted) (2019)
Access to Exercise Opportunities: Percentage of population with adequate access to locations for physical activity (2010 & 2021)
Food Insecurity: Percentage of population who lack adequate access to food (2019)
Limited Access to Healthy Foods: Percentage of population who are low-income and do not live close to a grocery store (2019)

Tobacco Use

Adult Smoking: Percentage of adults (age 18 and older) who are current smokers (2019)

Sexual Activity

Sexually Transmitted Infections (STI): Number of newly diagnosed chlamydia cases per 100,000 population (2019)

Population

County Class: County description categorized by population (Urban \(\geq\) 200k, Rural \(\leq\) 50k, 200k \(<\) Suburban \(<\) 50k) (2020)²

State	County	Description	YPLL	Obesity %	Excessive Drinking %	Smoking %	STI
Alabama	Crenshaw	Rural	12024.580	35.6	15.21043	24.0	798.7
Arkansas	Garland	Suburban	11824.024	36.2	17.77078	21.0	476.9
California	Placer	Urban	4579.361	24.6	21.71065	11.0	288.2
Louisiana	Acadia	Suburban	11119.991	42.5	21.98766	24.7	556.0
Oklahoma	Osage	Rural	8957.526	38.8	16.51351	22.9	417.3

Exploratory Data Analysis

We first started by exploring the marginal distribution of our response variable, YPLL. Given that we are working with county-level data, each YPLL observation represents a single county and each binwidth represents 1,000 years. From the histogram, we noted that YPLL is positively skewed and centered at approximately 8,000 years.

Next we modeled our predictors’ relationships with the response variable. To do this comprehensively yet efficiently, we created pairs plots and a correlation matrix. For the purposes of this report, we’ve included only some scatter plots between our response and predictors that we believe captures the most significant relationships. Of our explanatory variables that demonstrated a significant relationship with YPLL, their relationship was largely linear with some curvature at the ends.

Given that county class is our only categorical predictor, we created a violin plot overlayed on top of a boxplot to model the relationship between county types and YPLL. Interestingly, we found that the median YPLL and YPLL variability increases as population decreases. The median increasing could be potentially due to poorer access to high quality healthcare in more rural areas versus the city or other factors. However, the high degree of variation in the YPLL across county classes could be explained by increased variability in smaller populations. In smaller populations, outliers can more significantly impact county averages than larger populations. That being, in some rural counties the sample sizes are too small resulting in an unstable average YPLL.

To test for collinearity we plotted the predictors against each other. We found that the diet/exercise variables were positively correlated with each other which could potentially complicate capturing inference from our models. We also saw that excessive drinking shared an inverse relationship with food insecurity. On the other hand, alcohol-impaired driving deaths and smoking were independent of each other as their scatterplot resembled a random scatter with zero slope.

Methods

Given the relatively linear relationships discovered in our EDA, we explored linear regression, general additive models (GAM), and regularization techniques. Additionally, to explore potential interactions between predictors we built a flexible tree based model using an Extreme Gradient Boosting (XGBoost) model.

Linear Regression

We designed a linear regression model using all of our 11 predictors to serve as a baseline model for future model selection. Given the general linear relationships among YPLL and our predictors, we expect model to perform relatively well compared to more robust modeling techniques.

Regularization Techniques

We explored Lasso, Ridge, and Elastic Net modeling to discourage learning more complex models and reduce the risk of overfitting. Given the collinearity between our exercise/diet related predictors, we designed Ridge and elastic net models to avoid potentially randomly removing highly collinear variables.

The gamma parameter determines the mix between a relaxed and regularized fit in modeling. To tune the gamma parameter, we performed 10-fold cross validation to minimize CV error.

General Additive Model

GAMs are transparent, interpretable, and flexible models that are sure to outperform the linear regression because it could capture both linear and nonlinear relationships with splines. Furthermore, there are built in regularization penalties on controlling the smoothness of predictor functions to reduce overfitting.

Using backwards stepwise variable selection, I began with a full GAM using all predictors. I then removed the least significant predictor, based on the F Statistic. In GAM output, the F statistic measures a feature’s impact on the output. Lastly, I calculated the model’s root mean squared error (RMSE) using cross validation. I continued this process until I exhausted all the predictors and I selected the model with the least RMSE.

This stepwise technique led us to create a simplified GAM with 7 features: Physical Inactivity, Obesity, Food Insecurity, Excessive Drinking, Smoking, STI, and County Class Descriptions.

Note that GAM treats categorical variables (e.g., County Class) essentially as intercepts where one class is a baseline. So essentially the GAM fit is offset by some factor depending on which categorical class is predicted.

XGBoost

An XGBoost model is an implementation of gradient boosted machines with additional features for regularization and parallelization for faster computing times. This tree based model can potentially capture interactions among our predictors that we may have overlooked in our linear regression and GAM models.

To tune the model we performed grid search cross validation on the following parameters: number of trees, tree depth, max features per tree, sampling proportion, gamma (regularization), and learning rate. We used squared error as our cost function.

Results

We first explored the performance of our linear regression model using all 11 features compared to a lasso, ridge, and elastic net techniques. Using 5-fold stratified cross validation, we evaluated model performance on predicting YPLL using RMSE as our cost function. Using stratified cross validation ensures that relative county frequencies within group (i.e., state) is approximately preserved in each train and validation fold.

Despite the purpose of using regularization approaches being to reduce overfitting/variance, our linear regression model outperformed all three regularization techniques on the holdout data within intervals. This led us to explore a non-parametric approach that could still preserve linear relationships in the data using GAM.

We initially built a GAM on the same 11 predictors as the linear regression model. After performing feature selection, we also constructed a simplified GAM using only 7 of the most important features. We then tuned an XGBoost model and performed 5-fold stratified cross validation to compare the performances of the linear regression, full GAM, simplified GAM and XGBoost using RMSE.

The simple GAM (7 features) outperforms the full GAM (11 features) and linear regression within intervals. Among the GAM, the simplified model’s better performance is likely due to reduce model variability compared to the full model. However, the average performance of the simplified GAM and XGBoost model are nearly identical. Therefore, we can base our selection of the simple GAM over XGBoost because it has fewer parameters. With less parameters, the simple GAM leads to easier explainability compared to XGBoost. And for practical use, it is easier for users to interact with the predictive model by inputting 7 features rather than 11.

When plotting our fitted values on a holdout set of actual values for YPLL we found that are simple GAM model predicts well for a majority (90%) of the data when YPLL < 12,500. However, once YPLL > 12,500 our model tends to underestimate these observations. This is a common phenomenon in regression where the largest observations are underestimated.

Partial dependence plots (PDP) allow us to isolate a single predictor’s impact on the average response predictions. Given that GAM splines are generally not constant and linear, it is useful to visualize predictors’ varying relationships with our response. PDP relationships can be causally interpreted for the GAM model because the response is explicitly modeled as a function of our features (this does not mean there are necessarily causal relationships in the real world).

For instance, for the smoking PDP we can see a positive and relatively linear relationship between % of smokers and the predicted YPLL. The slope of this line varies across values of smoking but always remains positive. Also the relationship becomes more warped at the tails where data points are less frequent (we shouldn’t overinterpret regions with sparse data).

A limitation of interpreting PDPs is that features are assumed to be independent. Practically, this assumption is often times violated. In our data, predictors relating to diet and exercise (e.g., Obesity, Inactivity, and Lack of Access to Healthy Foods) are correlated with one another thereby violating the independence assumption. Therefore, when independence is violated we should not interpret PDP relationships as causal for the GAM model.

Discussion

We’ve built a Shiny App that hosts the predictive modeling and some EDA capabilities. Users can interact with the app by inputting values for our features to return a predicted YPLL based on our simple GAM. The app also allows users to explore the geographical distribution of YPLL based on race/ethnicity across different regions in the US (e.g., South, West, Midwest, Northeast). Users can similarly explore the distribution of all our predictor variables across US regions.

Here is the link to the Shiny App.

Conclusions

Our simplified GAM outperforms our full GAM, linear regression, and regularized model in predicting premature deaths.
Our simplified GAM performs equally to XGBoost but provides a more interpretable modeling approach.

Limitations

The years our response and predictor variables were collected vary. This complicates drawing direct inference of health behaviors relationship with premature deaths. For prediction purposes, we believe it is valid to use variables from varying years because time is not independent across counties. Therefore, health behaviors– such as the proportion of tobacco smokers– shouldn’t significantly vary within a couple years within the same county.
Due to anonymity constraints, we used county level health data rather than individual level data. However, data aggregated at the county level significantly underestimates variability in the data and does not give us insight into the demographic features of our predictor variables. With individual level data, we could provide more precise insights to model premature deaths across racial groups and other demographic features.

Future Work

Given the scope of this project, we focused on health behaviors’ relationship with premature deaths. However, health behaviors only capture a piece of the factors that influence premature deaths. In the future, building a robust model using other health factors (e.g., clinical care, socioeconomic factors) to predict premature deaths could create a more complete picture.
In cohort studies, health data is collected from individuals among a specific area or demographic. Using a cohort study, the predictive model could show health behavior effects on premature mortality at the individual level and we could account for more specific health measures, such as blood pressure and waist circumference.

References

Defining rural population: Guidance portal. Defining Rural Population | Guidance Portal. (2020). Retrieved July 25, 2022, from link.

Explore health rankings: County health rankings model. County Health Rankings & Roadmaps. (n.d.). Retrieved July 25, 2022, from link.

Where is the term “urban county” defined within the CDBG program. HUD Exchange. (2019). Retrieved July 25, 2022, from link.

County Health Rankings & Roadmaps↩︎
Guidance Portal (2020); HUD Exchange (2019)↩︎