Intro

1. COVID-19 Vaccination:

It’s been almost three years since COVID-19 has been around and affected many people around the world.
Every person in this room has either been exposed to the virus, or been in contact with a person who has the virus.
To alleviate these problems, vaccines were created and administered to the world, as it means to provide protection against the virus.

How does vaccination rates affect Covid deaths and why do some counties with high vaccination rates still have high Covid death rates?

2. Access to Healthcare:

The motivation and purpose for this research is to highlight the issues and external factors that have impacted people’s access to clinical care, which ultimately determines the status of people’s health in specific areas.
According to webmd.com, “Your mental health plays a huge role in your general well-being. Being in a good mental state can keep you healthy and help prevent serious health conditions.
A study found that positive psychological well-being can reduce the risks of heart attacks and strokes. On the other hand, poor mental health can lead to poor physical health or harmful behaviors. Depression has been linked to many chronic illnesses. These illnesses include diabetes, asthma, cancer, cardiovascular disease, and arthritis.
Mental health conditions can also make dealing with a chronic illness more difficult. “The mortality rate from cancer and heart disease is higher among people with depression or other mental health conditions”. This shows that there is a serious link between mental and physical health. Poor mental health can lead to chronic illness.

Does higher unemployment or higher uninsured ratings lead to larger amounts of people having poor or fair health?
Does the amount of health providers in the area affect whether or not you have more or less poor mental health days?

3. Socioeconomic Factors:

The motivation and purpose for this research is to develop a health account that could potentially be used to evaluate the access and quality of direct medical and public health services in a county.
Inequities in access and quality of health care contribute directly to disparities in health and inequities across socioeconomic factors also known as the social determinants of health.
The subgroups are determined by demographic factors. Their potential interaction with the medical and public services is tracked through modeling their socioeconomic factors against their death rates and the sum of survivors at the county level.
Measuring the health of the population under this assumption reduces to calculating death rates, measuring the health of individual survivors, and aggregating across individuals at the county level.
Thus, we will look at the rates of years of potential life lost and preventable hospital stays to examine the health of a population or subgroup within a county and the sum (or, equivalently, the average) of the health of survivors in the population or subgroup.

Do income inequality, unemployment and high school completion rates affect the number of premature deaths of certain racial groups at the county level?
Do income inequality, unemployment and high school completion rates affect the number of preventable hospital stays of certain racial groups at the county level?

Data

Dataset Descriptions:

The County Health Rankings dataset has 3,193 observations, and each row in the dataset represents a county in the United States that has given publicly available data.
The County Health Rankings dataset consists of 256 features that are a mixture of continuous, discrete and categorical variables within these features.

For the Covid data we consulted two main website sourses: data.cdc.org for vaccine data and covid19.census.org for covid data.

CDC Data: Here, vaccination rates for people who are fully vaccinated with two doses, those who are vaccinated with a single dose, and those who are fully vaccinated with two doses of the vaccine and a booster shot were provided and utilized. County-level data points were also provided for each data point. This was about 1,824,893 rows which were aggregated by state and county, bringing that number to about 3123 observation
Covid Census Data: the covid census data was used which contains covid-19 deaths per county and for each ethnic groups (Blacks, Whites, Hispanics, Asians, etc) across the United states. It also contains information from the 2019 census showing how many people lived in a particular county during that year. There were a total of 1128 observations.

Variables:

COVID death rate: this is the amount of Covid deaths in a particular county divided by the total deaths.
Vaccination rate: these are different vaccination rates scenarios such as people are vaccinated with one dose and people who are vaccinated with two doses of the Covid-19 vaccine and lastly people who are fully vaccinated with a booster shot.

EDA

Urban Rural Description [Covid-19]

This is a histogram showing the number of different urban descriptions represented from our data.

##                         County State Total_Deaths Covid_Deaths
## 1       Anchorage Municipality    AK         6122          693
## 2 Fairbanks North Star Borough    AK         1448          175
## 3    Matanuska-Susitna Borough    AK         1781          221
## 4               Autauga County    AL         1309          171
## 5               Baldwin County    AL         5915          600
## 6               Calhoun County    AL         4147          589

##   State           County Fully_Vaccinated Only_One_dose Vacinated+Booseter
## 1    SC Abbeville County             40.5          45.6               46.7
## 2    LA    Acadia Parish             53.4          59.7               36.7
## 3    VA  Accomack County             74.2          83.6               46.1
## 4    ID       Ada County             66.5          72.3               49.8
## 5    IA     Adair County             49.7          53.5               60.0
## 6    KY     Adair County             39.6          44.6               43.6
##   Census2019
## 1      24527
## 2      62045
## 3      32316
## 4     481587
## 5       7152
## 6      19202

##                         County State Fully_Vaccinated Vacinated+Booseter
## 1       Anchorage Municipality    AK             68.4               46.4
## 2 Fairbanks North Star Borough    AK             60.8               38.4
## 3    Matanuska-Susitna Borough    AK             41.2               40.5
## 4               Autauga County    AL             44.7               35.0
## 5               Baldwin County    AL             51.6               36.8
## 6               Calhoun County    AL             47.5               37.1
##   Only_One_dose Covid_Deaths Total_Deaths Census2019
## 1          77.0          693         6122     288000
## 2          69.5          175         1448      96849
## 3          46.7          221         1781     108317
## 4          56.4          171         1309      55869
## 5          65.1          600         5915     223234
## 6          57.9          589         4147     113605

This plot shows three linear regressions lines corresponding to three different vaccination scenarios in the US.
Blue shows Vaccinated with only one dose , green represents people who are fully vaccinated with a booster shot
Red represents people who are fully vaccinated with two doses of the coronavirus vaccine.

##              County State Fully_Vaccinated Vacinated+Booseter Only_One_dose
## 71  Imperial County    CA             95.0               41.5          95.0
## 686   Queens County    NY             87.7               39.1          95.0
## 840  Montour County    PA             80.3               57.4          92.1
## 939  Cameron County    TX             80.4               39.9          95.0
## 947  El Paso County    TX             82.6               39.7          95.0
## 960  Hidalgo County    TX             76.9               36.0          95.0
## 974 Maverick County    TX             95.0               33.3          95.0
## 998     Webb County    TX             95.0               27.0          95.0
##     Covid_Deaths Total_Deaths Census2019 COVID-19 Deaths over Total Deaths
## 71           825         3430     181215                         0.2405248
## 686         8859        39468    2253858                         0.2244603
## 840          782         3585      18230                         0.2181311
## 939         2417        10117     423163                         0.2389048
## 947         4296        19985     839238                         0.2149612
## 960         3838        16310     868707                         0.2353158
## 974          250         1131      58722                         0.2210433
## 998         1186         5010     276652                         0.2367265
##     COVID-19 Deaths over Population COVID-19 Deaths per Capita
## 71                     1.327290e-06                   240.5248
## 686                    9.958938e-08                   224.4603
## 840                    1.196550e-05                   218.1311
## 939                    5.645692e-07                   238.9048
## 947                    2.561386e-07                   214.9612
## 960                    2.708805e-07                   235.3158
## 974                    3.764234e-06                   221.0433
## 998                    8.556835e-07                   236.7265

Does higher unemployment or higher uninsured ratings lead to larger amounts of people having poor or fair health?

Variables:

Unemployment raw value (Percentage of population ages 16 and older unemployed but seeking work)
Uninsured raw value (Percentage of population under age 65 without health insurance)
Poor or Fair Health raw value (Percentage of adults reporting fair or poor health (age-adjusted)

EDA:

Hypothesis - States with higher Unemployment ratings or higher Uninsured people will result large percentages of people with poor, or fair health.

Unemployment Rating Map for all 50 states

Uninsured Ratings Map for all 50 states

Poor or Fair Health Ratings Map for all 50 states

Results:

Higher Uninsured Ratings does lead to higher Poor or Fair Health Ratings but Unemployment does not.

Question - Does the amount of Mental health providers given in each area leads to more

#Poor mental health day in a particular population.

Variables:

Mental health providers raw value (Ratio of population to mental health providers)
Poor mental health days raw value (Average number of mentally unhealthy days reported in past 30 days age-adjusted)

Hypothesis- Less Mental health providers provided in an area will likely to lead to a person having more Poor mental health days.

Scatter plot displaying Mental Health Providers vs. Poor Mental Health Days

Results

Yes, this shows that less mental providers leads to people having more poor mental health days

Question - Do income inequality, unemployment and high school completion rates affect the number of premature deaths of certain racial groups at the county level?

Hypothesis - Better income inequality ratios, unemployment rates and high school completion rates will be associated with a lower Years of Potential Life Lost rate but there will be confounding variables that will not allow us to definitely conclude that these socioeconomic variables will lead to decreased Years of Potential Life Lost rates per racial group

Variables:

income: Ratio of household income at the 80th percentile to income at the 20th percentile.

unemployment: Percentage of population ages 16 and older unemployed but seeking work.

hs_completion: Percentage of adults ages 25 and over with a high school diploma or equivalent.

premature_deaths: Years of potential life lost before age 75 per 100,000 population (age-adjusted). This is a count of the number of years instead of the number of deaths.

mental_health: Ratio of population to mental health providers.

primary_care: Ratio of population to primary care physicians.

EDA:

The number of years of premature death (years of life lost) across American counties is highest among the Native American population followed by the Black population, and the number of premature deaths across these counties is lowest among the Asian/Pacific Islander population.

Visualization of Pivot_Longer for Years of Potential Life Lost

This histogram sows the disproportionate distribution of premature death years across the five racial groups we are examining in our EDA.

When looking at the income inequality variable we see that an increase in income inequality across all racial groups except for the Asian/Pacific Islander population are associated with an increase in years of life lost (premature death years).

When looking at unemployment, as the percentage of population 16 and older unemployed increases, it seems that an increase in unemployment rates is associated with an increase in the number of years of potential life lost premature deaths increases steadily.

When looking at High School completion rates, an increase in the percentage of adults 25 and older in a county with high school diplomas seems to be associated with a decrease in years of potential life lost (premature death years).

County Plot Premature Deaths

## [1] "Name"                      "Premature death raw value"
## [3] "5-digit FIPS Code"         "fips"                     
## [5] "premature_deaths"

Second Research Question for Race:

Do income inequality, unemployment and high school completion rates affect the number of preventable hospital stays of certain racial groups at the county level?

Hypothesis:

Better income inequality ratios, unemployment rates and high school completion rates will be associated with a lower number of Preventable Hospital Stays but there will be confounding variables that will not allow us to definitely conclude that these socioeconomic variables will lead to decreased numbers of Preventable Hospital Stays per racial group

Ranking the Selected Variables by State

## [1] "hs_completion" "income"        "unemployment"  "primary_care" 
## [5] "mental_health"

Visualization of Pivot_Longer for Preventable Stays

##  [1] "county"               "fips"                 "preventable_AIAN"    
##  [4] "hs_completion"        "income"               "unemployment"        
##  [7] "state"                "primary_care"         "mental_health"       
## [10] "preventable_Asian"    "preventable_Black"    "preventable_Hispanic"
## [13] "preventable_White"    "uninsured"            "premature"

## # A tibble: 3,142 × 18
##    fips  abbr  county.x     pop_2…¹ count…² preve…³ hs_co…⁴ income unemp…⁵ state
##    <chr> <chr> <chr>          <dbl> <chr>     <dbl>   <dbl>  <dbl>   <dbl> <chr>
##  1 01001 AL    Autauga Cou…   55347 Autaug…      NA   0.885   5.09  0.0273 AL   
##  2 01003 AL    Baldwin Cou…  203709 Baldwi…    3128   0.908   4.39  0.0273 AL   
##  3 01005 AL    Barbour Cou…   26489 Barbou…      NA   0.732   5.98  0.0380 AL   
##  4 01007 AL    Bibb County    22583 Bibb C…      NA   0.791   5.00  0.0306 AL   
##  5 01009 AL    Blount Coun…   57673 Blount…      NA   0.805   4.43  0.0267 AL   
##  6 01011 AL    Bullock Cou…   10696 Bulloc…      NA   0.747   5.63  0.0363 AL   
##  7 01013 AL    Butler Coun…   20154 Butler…      NA   0.850   5.01  0.0365 AL   
##  8 01015 AL    Calhoun Cou…  115620 Calhou…      NA   0.844   4.84  0.0354 AL   
##  9 01017 AL    Chambers Co…   34123 Chambe…      NA   0.816   4.80  0.0293 AL   
## 10 01019 AL    Cherokee Co…   25859 Cherok…      NA   0.816   4.83  0.0291 AL   
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## #   mental_health <dbl>, preventable_Asian <dbl>, preventable_Black <dbl>,
## #   preventable_Hispanic <dbl>, preventable_White <dbl>, uninsured <dbl>,
## #   premature <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## #   ³preventable_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

County Plot for Preventable Hospital Stays

## [1] "Name"                                
## [2] "Preventable hospital stays raw value"
## [3] "5-digit FIPS Code"                   
## [4] "fips"                                
## [5] "preventable_stays"

Methods

Random Forest Modeling

A random forest model was initially chosen because the RMSE of the model was less than that of the KNN model and multivariate linear regression.
The predictions for the larger model of our outcome variables (premature deaths and preventable hospital stays) were plotted against the observed values on the y-axis to see how well the model was at predicting both premature deaths and preventable hospital stays considering our explanatory variables in this particular model.
We used 50 trees as the number for the Random Forest and used the features: Race, Income Inequality, Unemployment Rate, High School Completion Rate, Ratio of Primary Care Physicians and Ratio of Mental Health Providers, Number of Uninsured adults, and either preventable hospital stays of premature deaths raw value depending on the response variable (Years of Potential Life Lost per 100,000 population or Preventable Hospital Stays per 100,000 population).

Random Forest Modeling by Race

Despite not having demographic data per input variable (explanatory variables) in our model, there are random forest models for the aforementioned features per race (Native American, Asian/Pacific Islander, Black, Hispanic, and White).

Here the motivation was to see: - If certain explanatory variables were more important to certain races versus other races. - How well the model was at predicting certain counties with certain confounding factors (population size, presence of a Native American reservation, or other systematic factors).

Random Forest Modeling + A Population Variable The motivation for joining a table with county population data to the county health rankings was to see if the random forest model would perform better with this variable.
Logistic Regression

The goal of using logistics regression analysis is to see if one can create predictions, based on former Covid data, the probability of a county having high Covid rates in the future.

Results

Random Forest with Years of Potential Life Lost as the predictor variable

## NULL

Random Forest with A Population Variable

##  [1] "county"             "fips"               "premature_AIAN"    
##  [4] "hs_completion"      "income"             "unemployment"      
##  [7] "state"              "primary_care"       "mental_health"     
## [10] "premature_Asian"    "premature_Black"    "premature_Hispanic"
## [13] "premature_White"    "uninsured"          "preventable"

## # A tibble: 3,142 × 18
##    fips  abbr  county.x     pop_2…¹ count…² prema…³ hs_co…⁴ income unemp…⁵ state
##    <chr> <chr> <chr>          <dbl> <chr>     <dbl>   <dbl>  <dbl>   <dbl> <chr>
##  1 01001 AL    Autauga Cou…   55347 Autaug…      NA   0.885   5.09  0.0273 AL   
##  2 01003 AL    Baldwin Cou…  203709 Baldwi…      NA   0.908   4.39  0.0273 AL   
##  3 01005 AL    Barbour Cou…   26489 Barbou…      NA   0.732   5.98  0.0380 AL   
##  4 01007 AL    Bibb County    22583 Bibb C…      NA   0.791   5.00  0.0306 AL   
##  5 01009 AL    Blount Coun…   57673 Blount…      NA   0.805   4.43  0.0267 AL   
##  6 01011 AL    Bullock Cou…   10696 Bulloc…      NA   0.747   5.63  0.0363 AL   
##  7 01013 AL    Butler Coun…   20154 Butler…      NA   0.850   5.01  0.0365 AL   
##  8 01015 AL    Calhoun Cou…  115620 Calhou…      NA   0.844   4.84  0.0354 AL   
##  9 01017 AL    Chambers Co…   34123 Chambe…      NA   0.816   4.80  0.0293 AL   
## 10 01019 AL    Cherokee Co…   25859 Cherok…      NA   0.816   4.83  0.0291 AL   
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## #   mental_health <dbl>, premature_Asian <dbl>, premature_Black <dbl>,
## #   premature_Hispanic <dbl>, premature_White <dbl>, uninsured <dbl>,
## #   preventable <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## #   ³premature_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

## # A tibble: 5,639 × 15
##    fips  abbr  county.x     pop_2…¹ count…² hs_co…³ income unemp…⁴ state prima…⁵
##    <chr> <chr> <chr>          <dbl> <chr>     <dbl>  <dbl>   <dbl> <chr>   <dbl>
##  1 01001 AL    Autauga Cou…    10.9 Autaug…   0.885   5.09  0.0273 AL    4.68e-4
##  2 01001 AL    Autauga Cou…    10.9 Autaug…   0.885   5.09  0.0273 AL    4.68e-4
##  3 01003 AL    Baldwin Cou…    12.2 Baldwi…   0.908   4.39  0.0273 AL    7.02e-4
##  4 01003 AL    Baldwin Cou…    12.2 Baldwi…   0.908   4.39  0.0273 AL    7.02e-4
##  5 01003 AL    Baldwin Cou…    12.2 Baldwi…   0.908   4.39  0.0273 AL    7.02e-4
##  6 01005 AL    Barbour Cou…    10.2 Barbou…   0.732   5.98  0.0380 AL    3.22e-4
##  7 01005 AL    Barbour Cou…    10.2 Barbou…   0.732   5.98  0.0380 AL    3.22e-4
##  8 01007 AL    Bibb County     10.0 Bibb C…   0.791   5.00  0.0306 AL    5.36e-4
##  9 01007 AL    Bibb County     10.0 Bibb C…   0.791   5.00  0.0306 AL    5.36e-4
## 10 01009 AL    Blount Coun…    11.0 Blount…   0.805   4.43  0.0267 AL    2.07e-4
## # … with 5,629 more rows, 5 more variables: mental_health <dbl>,
## #   uninsured <dbl>, preventable <dbl>, race <chr>, premature_deaths <dbl>, and
## #   abbreviated variable names ¹pop_2015, ²county.y, ³hs_completion,
## #   ⁴unemployment, ⁵primary_care
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = new_rf_df, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      5639 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.07069812 
## R squared (OOB):                  0.6813557

Random Forest per Race

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = pt_NA, num.trees = 50, importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      331 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.1424433 
## R squared (OOB):                  0.4745165

## [1] "Maverick County"  "Blaine County"    "Roosevelt County" "Benewah County"  
## [5] "Cass County"      "Woodbury County"  "Neshoba County"

## [1] "TX" "OK" "MT" "ID" "MN" "IA" "MS"

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = pt_Asian, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      371 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.09979983 
## R squared (OOB):                  0.3048306

## [1] "Garfield County"           "Washington County"        
## [3] "Outagamie County"          "Matanuska-Susitna Borough"
## [5] "Johnson County"            "Hawaii County"            
## [7] "Washington County"

## [1] "OK" "AR" "WI" "AK" "TX" "HI" "UT"

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = pt_Black, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      1319 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.0672776 
## R squared (OOB):                  0.3322235

## [1] "Chaves County"    "Logan County"     "Miller County"    "Jefferson County"
## [5] "Saline County"    "Jackson County"   "Lafayette County"

## [1] "NM" "WV" "GA" "OH" "IL" "TX" "MO"

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = pt_Hispanic, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      860 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.06681635 
## R squared (OOB):                  0.3476807

## [1] "Las Animas County" "Rio Arriba County" "Carbon County"    
## [4] "Anderson County"   "Quay County"       "Otero County"     
## [7] "Conejos County"

## [1] "CO" "NM" "UT" "TX" "NM" "CO" "CO"

## Ranger result
## 
## Call:
##  ranger(premature_deaths ~ ., data = pt_White, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      2758 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.0422079 
## R squared (OOB):                  0.5256403

## [1] "Petersburg city"  "Alpine County"    "Covington city"   "Dolores County"  
## [5] "McCormick County" "Union County"     "Monroe County"

## [1] "VA" "CA" "VA" "CO" "SC" "FL" "AR"

Results

We found that our model cannot predict for certain outlier counties due to random effects of variables we cannot measure like number of random deaths in a smaller counties
We were able to see that while including population in our random forest improved the predictability of the model, it did so very slightly.
Per race, the explainability of the current socioeconomic and access/quality of clinical care variables varies and this is because of confounding variables that exist to explain counties of high years of potential life lost. It may be the case that there are systematic differences on a county-by-county basis across the different racial groups that we just cannot measure or that we don’t have access to.

Random Forest Modeling with Population variable and Preventable Stays

##  [1] "county"               "fips"                 "preventable_AIAN"    
##  [4] "hs_completion"        "income"               "unemployment"        
##  [7] "state"                "primary_care"         "mental_health"       
## [10] "preventable_Asian"    "preventable_Black"    "preventable_Hispanic"
## [13] "preventable_White"    "uninsured"            "premature"

## # A tibble: 3,142 × 18
##    fips  abbr  county.x     pop_2…¹ count…² preve…³ hs_co…⁴ income unemp…⁵ state
##    <chr> <chr> <chr>          <dbl> <chr>     <dbl>   <dbl>  <dbl>   <dbl> <chr>
##  1 01001 AL    Autauga Cou…   55347 Autaug…      NA   0.885   5.09  0.0273 AL   
##  2 01003 AL    Baldwin Cou…  203709 Baldwi…    3128   0.908   4.39  0.0273 AL   
##  3 01005 AL    Barbour Cou…   26489 Barbou…      NA   0.732   5.98  0.0380 AL   
##  4 01007 AL    Bibb County    22583 Bibb C…      NA   0.791   5.00  0.0306 AL   
##  5 01009 AL    Blount Coun…   57673 Blount…      NA   0.805   4.43  0.0267 AL   
##  6 01011 AL    Bullock Cou…   10696 Bulloc…      NA   0.747   5.63  0.0363 AL   
##  7 01013 AL    Butler Coun…   20154 Butler…      NA   0.850   5.01  0.0365 AL   
##  8 01015 AL    Calhoun Cou…  115620 Calhou…      NA   0.844   4.84  0.0354 AL   
##  9 01017 AL    Chambers Co…   34123 Chambe…      NA   0.816   4.80  0.0293 AL   
## 10 01019 AL    Cherokee Co…   25859 Cherok…      NA   0.816   4.83  0.0291 AL   
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## #   mental_health <dbl>, preventable_Asian <dbl>, preventable_Black <dbl>,
## #   preventable_Hispanic <dbl>, preventable_White <dbl>, uninsured <dbl>,
## #   premature <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## #   ³preventable_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = new_rf_df2, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      6070 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2463433 
## R squared (OOB):                  0.2326381

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = longer_NA, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      394 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.5744974 
## R squared (OOB):                  -0.03572857

## NULL

## character(0)

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = longer_Asian, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      415 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.3241022 
## R squared (OOB):                  0.1340968

## NULL

## character(0)

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = longer_Black, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      1438 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2828635 
## R squared (OOB):                  -0.004559765

## NULL

## character(0)

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = longer_Hispanic, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      1055 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.3937324 
## R squared (OOB):                  0.05917461

## NULL

## character(0)

## Ranger result
## 
## Call:
##  ranger(preventable_stays ~ ., data = longer_White, num.trees = 50,      importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      2768 
## Number of independent variables:  11 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.1114373 
## R squared (OOB):                  0.2995497

## NULL

## character(0)

## NULL

## character(0)

## Results

We found that our model cannot predict for certain outlier counties due to random effects of variables we cannot measure like number of random accidents in a smaller counties
We were able to see that while including population in our random forest improved the predictability of the model, it did so very slightly.
Per race, the explainability of the current socioeconomic and access/quality of clinical care variables varies and this is because of confounding variables that exist to explain counties of preventable hospital stays
Compared to Years of Potential Life Lost, preventable hospital stays is lower across counties in America no matter the race. However, our model predicts not as well for the variable of preventable hospital stays compared to premature deaths (Years of Potential Life Lost).

It may be the case that there are systematic differences on a county-by-county basis across the different racial groups that we just cannot measure or that we don’t have access to.

Logistic Regression

##    [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##   [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 1
##   [75] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0
##  [112] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [149] 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [223] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
##  [260] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
##  [297] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [371] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1
##  [408] 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##  [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0
##  [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
##  [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 0
##  [630] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1
##  [667] 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
##  [704] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [778] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
##  [815] 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
##  [852] 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [926] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [963] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
## [1000] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0
## [1074] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0

## 
## Call:
## glm(formula = high_indicator ~ `COVID-19 Deaths over Population`, 
##     family = "binomial", data = comb_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0025  -0.6780  -0.3785  -0.0892   5.7338  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -0.3861     0.1322  -2.921  0.00349 ** 
## `COVID-19 Deaths over Population`  -2.2480     0.2573  -8.737  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 971.83  on 1122  degrees of freedom
## Residual deviance: 813.32  on 1121  degrees of freedom
## AIC: 817.32
## 
## Number of Fisher Scoring iterations: 7

##    [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##   [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 1
##   [75] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0
##  [112] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [149] 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [223] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
##  [260] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
##  [297] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [371] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1
##  [408] 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
##  [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0
##  [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
##  [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 0
##  [630] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1
##  [667] 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
##  [704] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [778] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
##  [815] 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
##  [852] 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [926] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [963] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
## [1000] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0
## [1074] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0

## 
## Call:
## glm(formula = high_indicator ~ Covid_Deaths, family = "binomial", 
##     data = comb_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7497  -0.4645  -0.4373  -0.4266   2.2133  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.399e+00  1.159e-01 -20.688  < 2e-16 ***
## Covid_Deaths  3.540e-04  5.635e-05   6.282 3.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 809.01  on 1122  degrees of freedom
## Residual deviance: 759.37  on 1121  degrees of freedom
## AIC: 763.37
## 
## Number of Fisher Scoring iterations: 5

This Logit regression plot is for the Covid Deaths deaths/ total population data , with vaccination rates converted to binary data points (Counties with high vaccination rates, higher than 70% are given a value of 1 and counties with low vax rates, less than 70%, are given a value 0f 0.)

##Calibration Plot for Logistic Regression

Calibration plot was made to show how well the logistic regression is performing.

Calibration plot for predicted data for model assessment. The goal is to eventually fine tune the data (i.e taking account of the effect of outliers on the skewness.)

##Prediction Model

##  [1] "County"                            "State"                            
##  [3] "Fully_Vaccinated"                  "Vacinated+Booseter"               
##  [5] "Only_One_dose"                     "Covid_Deaths"                     
##  [7] "Total_Deaths"                      "Census2019"                       
##  [9] "COVID-19 Deaths over Total Deaths" "COVID-19 Deaths over Population"  
## [11] "COVID-19 Deaths per Capita"        "high_indicator"

More model assessments for our predictions, showing a few outliers. Outliers turned out to be a few counties in Texas.

Discussion

Conclusion:

Having a Vaccination matters in preventing Covid Deaths
Racial stratification increases socioeconomic disadvantage and other risk factors for poor health among minorities (Native American, Black, etc.) compared to Whites.
While some socioeconomic factors like income inequality can begin to capture the inequitable distribution of wealth, power and opportunity which create these lasting racial stratification, we simply do not have the measurable variables available to better predict what is exactly causing high death rates or low survival rates across race and location.
Certainty some systematic differences exist for certain marginalized group on a county-by-county basis

The Limitations for the County Health Ranking dataset include:

Limited demographic for different racial groups
Too many raw value variables so we had to convert some of the variables to percentages when doing our EDA
Too many raw value variables that we had to convert during the EDA

Future Work:

For the COVID-19 data, we would like to see a Poisson regression in order to predict COVID-19 deaths as well as American counties with high COVID-19 transmission Rates.
For access to care, we would like to see policy recommendations be researched and developed from the data analysis that was done for the access to care variables. Access to care can lead to practical applications like where to map new facilities that assist health like for mental health on a county by county basis as a way to best impact these under-resourced areas.
For socioeconomic factors, we would like to see what types of operations or clinic visit types are most prevalent to different social or economically marginalized groups in the country. From there, we can see where the healthcare system fails certain racial groups. We would also like to dive deeper into systematic issues that might be captured with our variables that we did not have at our disposal.

The Effects of Social and Economic Factors on Death, Health and Survival

Intro

1. COVID-19 Vaccination:

2. Access to Healthcare:

3. Socioeconomic Factors:

Data

Dataset Descriptions:

Variables:

EDA

Urban Rural Description [Covid-19]

Does higher unemployment or higher uninsured ratings lead to larger amounts of people having poor or fair health?

Variables:

EDA:

Unemployment Rating Map for all 50 states

Uninsured Ratings Map for all 50 states

Poor or Fair Health Ratings Map for all 50 states

Results:

Question - Does the amount of Mental health providers given in each area leads to more

Variables:

Scatter plot displaying Mental Health Providers vs. Poor Mental Health Days

Results

Question - Do income inequality, unemployment and high school completion rates affect the number of premature deaths of certain racial groups at the county level?

Variables:

EDA:

Visualization of Pivot_Longer for Years of Potential Life Lost

County Plot Premature Deaths

Second Research Question for Race:

Do income inequality, unemployment and high school completion rates affect the number of preventable hospital stays of certain racial groups at the county level?

Ranking the Selected Variables by State

Visualization of Pivot_Longer for Preventable Stays

County Plot for Preventable Hospital Stays

Methods

Results

Random Forest with Years of Potential Life Lost as the predictor variable

Random Forest with A Population Variable

Random Forest per Race

Results

Random Forest Modeling with Population variable and Preventable Stays

Logistic Regression

Discussion