Executive Summary
This study seeks to determine social factors associated with wealth
inequality that are predictors of adverse events from substance use. The
study was conducted utilizing data produced by University of Wisconsin’s
Population Health Institute’s County Health Rankings & Roadmaps for
2023. The four predictor variables used were unemployment, mental health
providers, primary care providers, and food insecurity. The two response
variables used were drug overdose deaths and alcohol-related incidents.
Preliminary exploratory data analysis was conducted to understand which
geographic regions had the highest concentration of the selected
variables as well as which variables had the strongest relationships
with the response variables. Given the seemingly linear relationships
explored in the exploratory data analysis, linear regression,
regularization techniques, and a random forest model were compared, and
the random forest model produced the most accurate predictions. From
there, a variable importance model revealed that the most important
predictors for drug overdose deaths are poor mental health days,
insufficient sleep, and mental health providers, whereas the most
important predictor for alcohol-impaired driving deaths is unemployment,
controlling for demographic information. With this information, United
Health Group (UHG) can implement targeted solutions to help reduce
fatalities from substance use and expand into further studies in which
demographic and individualized data is used.
Introduction
Question: Are there demographic and social factors that are
predictors of drug overdose, alcohol-related incidents (e.g., driving
accidents)?
Drug overdoses and alcohol impaired driving are actions that not only
affect the well being of an individual, but also the community. United
Health Group is a company that provides healthcare coverage along with
other health oriented services. As a company with a primary focus on the
well being of its customers, understanding and preventing the possible
causes of an issue as widespread as substance use, can aid in reducing
the effects, benefiting United Health group customers. To find possible
predictors of drug overdoses and alcohol impaired driving, after
preliminary exploratory data analysis, we hypothesize that unemployment,
food insecurity, primary care physicians, and mental health providers
could be possible indicators. We used County Health Ranking Data to
verify or dispute these hypotheses as well as determine other
indicators.
Motivation
As the COVID-19 pandemic placed massive pressure on the U.S. economy,
the wealth gap between the rich and the poor intensified. In the years
following, this economic inequality has continued to grow wider as “the
rich get richer and the poor get poorer”. The poverty rate in the U.S.
has also increased following a steady decline in the 2010s. At the peak
of the pandemic in 2020, the poverty rate increased by a percentage
point from 10.5% in 2019 to 11.4% (Khattar, 2022). It then increased
again to 11.6% in 2021 and to 14.4% in 2022 (Huq, 2022). At the same
time, there was a significant increase in retail alcohol sales during
2020, representing an increase by 20% from 2019 (Castaldelli-Maia,
2021). The U.S. population also dealt with the growing opioid epidemic
(U.S., 2023). With these trends in mind, the U.S. healthcare system is
challenged with treating and reducing substance abuse and overdose.
As these trends in wealth inequality, poverty, and alcohol and drug
sales follow the same increasing pattern, it is necessary to understand
if there are specific factors influencing the usage and abuse of
substances. Because of this, we chose a set of predictor variables that
we believe are affected by the wealth gap. With some substantive
results, UHG can tailor its efforts to help reduce substance-related
incidents.
Data
The data utilized comes from the University of Wisconsin’s Population
Health Institute’s County Health Rankings & Roadmaps (“How healthy”,
2023).
The data collected comes from counties in each state with the
addition of the District of Columbia. The information that was gathered
includes figures that are meant to provide comprehension of the current
and future health status of the county’s populations. The figures
include statistics on the direct medical standing of the county
populations, referred to as health outcomes, and the supporting
environmental factors that possibly contribute, referred to as health
factors. Demographic data is also included in the data set to provide
background to the information given from each county.
For this report we focused primarily on social factors, or health
factor data sets that could have a correlation to drug overdoses,
alcohol impaired driving, and substance use in general. With these data
sets we narrowed out specific variables to work as predictor variables
and response variables.
Predictor Variables
Four predictor variables were chosen as the center of focus for this
study. They are as listed below:
Unemployment: Percentage of population ages 16 and older
unemployed but seeking work
Mental Health Providers: Number of mental health care providers
per 100,000 of the population
Primary Care Physicians: Number of primary care physicians per
100,000 of the population
Food insecurity: Percentage of population who lack adequate
access to food
Response Variables
The following variables were chosen to be compared against the
predictor variables as they represent substance-related fatalities:
Alcohol-Impaired Driving Deaths: Percentage of driving deaths
with alcohol involvement
Drug overdose deaths: Number of drug poisoning deaths per 100,000
population
Exploratory Data Analysis & Data Summary
The following choropleth maps showcase regions of the continental
United States where the predictor variables are concentrated.
Drug Overdose Deaths

Drug overdose deaths appear to have the highest concentration in
states such as West Virginia and Maryland, among others.
Alcohol-Impaired Driving Deaths

Alcohol-impaired driving deaths appear to have the highest
concentration in states such as Montana and North Dakota, among
others.
Unemployment

Unemployment appears to have the highest concentration in states such
as California and New Mexico, among others.
Mental Health Providers

Mental health providers appear to be concentrated in states such as
Massachusetts, among others.
Primary Care Physicians

Primary care providers appear to be concentrated in states in the New
England region, among others.
Food Insecurity

Food insecurity appears to be concentrated in the Southern region, in
states such as Mississippi, among others.
Unemployment
Rising unemployment rates may contribute to higher substance use
rates because of the economic and mental effects of job loss. This is a
major risk factor for addiction and abuse which can lead to increased
drug and alcohol related deaths. We hypothesized that a higher rate of
unemployment would cause an increase in substance abuse deaths.


Unemployment vs Drug Overdose Deaths
The ‘unemployment’ vs ‘drug overdose deaths’ scatter plot represents
a positive relationship between the two variables for the majority of
the data. This means as the unemployment percentage increases, the
amount of drug overdose deaths increases as well. The line and
confidence band in the plot were generated by a smoothing spline, which
is a flexible machine learning estimator. We use the lines and bands as
visual aids to highlight qualitative trends, such as positive or
negative relationships, shown by the scatter plots themselves. There are
outliers in the data set that weaken the regression relationship and
make the graph appear less accurate.
Unemployment vs Alcohol Impaired Driving Deaths
The ‘unemployment’ vs ‘alcohol impaired driving deaths’ scatter plot
does not represent a clear relationship between the two variables. As
the unemployment percentage increases, the amount of alcohol impaired
driving deaths does not increase in a positive fashion.
Mental Health Providers
Mental Health Providers offer interventions and support that target
underlying factors that contribute to both drug overdose and alcohol
impaired driving deaths. We hypothesized that more mental health
providers in a specific area would lead to less deaths due to substance
abuse. The lack of counties that have an adequate amount of mental
health providers should be noted when observing the graph.


Mental Health Providers vs Drug Overdose Deaths
The ‘mental health providers’ vs ‘drug overdose deaths’ scatter plot
does not show a strong relationship between the two variables. Since
majority of the data set has less than 1% mental health providers, the
true impact that these providers have on the drug overdose can not be
seen.
Mental Health Providers vs Alcohol Impaired Driving
Deaths
The ‘mental health providers’ vs ‘alcohol impaired driving deaths’
scatter plot does not show a strong relationship between the two
variables.
Primary Care Physicians
Primary Care Physicians are important because they help to provide
essential information and treatment that can save the lives of patients
dealing with substance abuse. We hypothesized that more primary care
physicians in a certain area would lead to a decrease in drug overdose
and alcohol impaired driving deaths.


Primary Care Physicians vs Drug Overdose Deaths
The ‘primary care physicians’ vs ‘drug overdose deaths’ scatter plot
shows a weak relationship between the two variables. As the percentage
of physicians increases, the drug overdose deaths appear to
decrease.
Primary Care Physicians vs Alcohol Impaired Driving
Deaths
The ‘primary care physicians’ vs ‘alcohol impaired driving deaths’
scatter plot shows a weak, minimal relationship between the two
variables.
Food Insecurity
Food insecurity is an important variable to observe because it can
cause an individual to develop habits that lead to drug overdose and
alcohol impaired driving deaths. These habits can arise from the
consequences that come with having an inadequate amount of food to eat.
These consequences include stress and malnutrition which can cause an
individual to use extreme methods to cope. We hypothesized that the
greater the food insecurity in a certain area, the greater the amount of
substance abuse deaths.


Food Insecurity vs Drug Overdose Deaths
The ‘food insecurity’ vs ‘drug overdose deaths’ scatter plot shows a
positive, strong relationship between the two variables. This means as
the food insecurity percentage increases, the drug overdose deaths
increases as well.
Food Insecurity vs Alcohol Impaired Driving Deaths
The ‘food insecurity’ vs ‘alcohol impaired driving deaths’ scatter
plot shows a weak, minimal relationship between the two variables based
on the data points on the graph.
Methods
Before we can find the best factors to predict drug overdose deaths
and alcohol impaired driving deaths, we need to determine which method
of prediction works best with the data. To accomplish this, we explored
linear regression, regularization techniques, and a random forest.
Linear Regression
Linear regression assumes a linear relationship between the predictor
variables and outcome variables and estimates this relationship by
minimizing the least squares error. Linear regression is a very common
model in statistics and machine learning. It has many benefits,
including ease of interpretability, but it is often too inflexible to
capture nuanced relationships between variables.
Regularization
Regularization techniques can improve on linear regression by
reducing the impact of overfitting and collinearity. Both lasso and
ridge regression alter the linear regression function by introducing a
penalty. Lasso regression excludes less relevant variables, whereas
ridge regression discourages large coefficients to limit the impact of
outliers. Elastic net regression combines both lasso and ridge
regression.
Random Forest
The root of the random forest is the decision tree: partition the
data into similar subgroups, then meet certain conditions until a
stopping criteria is reached. Random forests take the average of the
decision trees used in a training phase to produce a final model,
thereby reducing overfitting of the data. Due to their accuracy,
robustness, and ease of use, random forests are among the most popular
machine learning tools in use today.
Comparing the predictive models
The figure shows the prediction error for the five estimators
described above, which we can use to decide which model to focus on
subsequently. For this purpose, we calculated the out-of-sample root
mean squared error of each estimator using cross-validation with five
folds. The point and whiskers show the average root mean squared error
and 95% confidence interval for each estimator. We can see that random
forest is the best.

As the random forest model outperformed linear regression and
regularization by producing lower root mean squared error values, we
will use a random forest model to determine the best predictors for drug
overdose deaths and alcohol impaired driving deaths.
Data Cleaning
As the data set had many missing values altering the outcome of the
predictive model, we decided to remove factors with more than 50 missing
values for a factor. This left us with a new data set with 2635 counties
with 63 variables for each.
Results
One output from the random forest is a variable importance model.
This model shows which variables are the most important in making
predictions, such as predicting drug overdose deaths; each variable’s
importance is determined by analyzing how much the random forest’s
predictive ability deteriorates in the absence of the given
variable.

The most significant predictor of drug overdose deaths is poor mental
health days, and the number of mental health providers (3) is the best
predictor among the four. Given that frequent mental distress (9) is
also in the top 10 predictors, a focus on improving access to mental
health care could help reduce the risk for drug overdose deaths.

Concerningly, four of the top five predictors for alcohol impaired
driving deaths are demographic factors, with American Indian or Alaska
Native far and away the most important predictor. Of the four variables
we focused on, unemployment is significantly more important in
predicting alcohol impaired driving deaths. A dual focus on decreasing
unemployment and increasing minority access to resources to combat
alcohol-related issues could decrease the incidence of these events.
Conclusion
The random forest outperformed both linear regression and the
regularization techniques.The most important predictors for drug
overdose deaths are poor mental health days, insufficient sleep, and
mental health providers.The most important predictors for alcohol
impaired driving deaths are American Indian or Alaska Native,
unemployment, and Asian.
Limitations
The data set utilized provided generalized data for each predictor
variable. It contained demographic categories for a few of the
variables, which were not the ones selected of interest. Therefore,
demographic data could not be explored as part of this study.
Additionally, the figures from the variables are influenced by state
level effects. These effects may overestimate the figure differences
between counties from different states. This can make comparing variable
of counties from different states less reliable than in state county
comparisons. Additionally, for the variable unemployment, the
statistical model used to collect data can vary from state to state.
Future Work
For this study, we primarily hypothesized possible socioeconomic
predictors of substance use, specifically focusing on the factors that
to some capacity reflect the attributes of the wealth gap in the
country. For future studies, analysis could be done on other
socioeconomic factors such as a emphasis on race, marital
status,political ideology and completion of higher education.This study
could be adapted to look at demographic categories within each predictor
variable to gauge and develop more specific solutions to help the most
vulnerable communities.
References
Castaldelli-Maia, J. M., Segura, L. E., & Martins, S. S. (2021,
November). The concerning increasing trend of alcohol beverage sales
in the U.S. during the COVID-19 pandemic. Alcohol (Fayetteville,
N.Y.). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8421038/#:~:text=There%20was%20a%20significant%20increase,the%20same%20period%20in%202019.
How healthy is your county?: County Health Rankings. County
Health Rankings & Roadmaps. (2023). https://www.countyhealthrankings.org/
Huq, S. (2022, July 25). 3.4 million more children in poverty in
February 2022 than December 2021. Columbia University Center on
Poverty and Social Policy. https://www.povertycenter.columbia.edu/news-internal/monthly-poverty-february-2022#:~:text=Monthly%20poverty%20remained%20elevated%20in,for%20the%20total%20US%20population.
Khattar, R., Pathak, A., Schweitzer, J., Khan, A., & Chang, R.
(2022, December 15). Data on poverty in the United States.
Center for American Progress. https://www.americanprogress.org/data-view/poverty-data/?yearFilter=2021&national=2021
U.S. Department of Health and Human Services. (2023, July 10).
Drug overdose death rates. National Institutes of Health. https://nida.nih.gov/research-topics/trends-statistics/overdose-death-rates