How does vaccination rates affect Covid deaths and why do some counties with high vaccination rates still have high Covid death rates?
The motivation and purpose for this research is to highlight the issues and external factors that have impacted people’s access to clinical care, which ultimately determines the status of people’s health in specific areas.
According to webmd.com, “Your mental health plays a huge role in your general well-being. Being in a good mental state can keep you healthy and help prevent serious health conditions.
A study found that positive psychological well-being can reduce the risks of heart attacks and strokes. On the other hand, poor mental health can lead to poor physical health or harmful behaviors. Depression has been linked to many chronic illnesses. These illnesses include diabetes, asthma, cancer, cardiovascular disease, and arthritis.
Mental health conditions can also make dealing with a chronic illness more difficult. “The mortality rate from cancer and heart disease is higher among people with depression or other mental health conditions”. This shows that there is a serious link between mental and physical health. Poor mental health can lead to chronic illness.
Does higher unemployment or higher uninsured ratings lead to larger amounts of people having poor or fair health?
Does the amount of health providers in the area affect whether or not you have more or less poor mental health days?
The motivation and purpose for this research is to develop a health account that could potentially be used to evaluate the access and quality of direct medical and public health services in a county.
Inequities in access and quality of health care contribute directly to disparities in health and inequities across socioeconomic factors also known as the social determinants of health.
The subgroups are determined by demographic factors. Their potential interaction with the medical and public services is tracked through modeling their socioeconomic factors against their death rates and the sum of survivors at the county level.
Measuring the health of the population under this assumption reduces to calculating death rates, measuring the health of individual survivors, and aggregating across individuals at the county level.
Thus, we will look at the rates of years of potential life lost and preventable hospital stays to examine the health of a population or subgroup within a county and the sum (or, equivalently, the average) of the health of survivors in the population or subgroup.
Do income inequality, unemployment and high school completion rates affect the number of premature deaths of certain racial groups at the county level?
Do income inequality, unemployment and high school completion rates affect the number of preventable hospital stays of certain racial groups at the county level?
The County Health Rankings dataset has 3,193 observations, and each row in the dataset represents a county in the United States that has given publicly available data.
The County Health Rankings dataset consists of 256 features that are a mixture of continuous, discrete and categorical variables within these features.
For the Covid data we consulted two main website sourses: data.cdc.org for vaccine data and covid19.census.org for covid data.
CDC Data: Here, vaccination rates for people who are fully vaccinated with two doses, those who are vaccinated with a single dose, and those who are fully vaccinated with two doses of the vaccine and a booster shot were provided and utilized. County-level data points were also provided for each data point. This was about 1,824,893 rows which were aggregated by state and county, bringing that number to about 3123 observation
Covid Census Data: the covid census data was used which contains covid-19 deaths per county and for each ethnic groups (Blacks, Whites, Hispanics, Asians, etc) across the United states. It also contains information from the 2019 census showing how many people lived in a particular county during that year. There were a total of 1128 observations.
COVID death rate: this is the amount of Covid deaths in a particular county divided by the total deaths.
Vaccination rate: these are different vaccination rates scenarios such as people are vaccinated with one dose and people who are vaccinated with two doses of the Covid-19 vaccine and lastly people who are fully vaccinated with a booster shot.
## County State Total_Deaths Covid_Deaths
## 1 Anchorage Municipality AK 6122 693
## 2 Fairbanks North Star Borough AK 1448 175
## 3 Matanuska-Susitna Borough AK 1781 221
## 4 Autauga County AL 1309 171
## 5 Baldwin County AL 5915 600
## 6 Calhoun County AL 4147 589
## State County Fully_Vaccinated Only_One_dose Vacinated+Booseter
## 1 SC Abbeville County 40.5 45.6 46.7
## 2 LA Acadia Parish 53.4 59.7 36.7
## 3 VA Accomack County 74.2 83.6 46.1
## 4 ID Ada County 66.5 72.3 49.8
## 5 IA Adair County 49.7 53.5 60.0
## 6 KY Adair County 39.6 44.6 43.6
## Census2019
## 1 24527
## 2 62045
## 3 32316
## 4 481587
## 5 7152
## 6 19202
## County State Fully_Vaccinated Vacinated+Booseter
## 1 Anchorage Municipality AK 68.4 46.4
## 2 Fairbanks North Star Borough AK 60.8 38.4
## 3 Matanuska-Susitna Borough AK 41.2 40.5
## 4 Autauga County AL 44.7 35.0
## 5 Baldwin County AL 51.6 36.8
## 6 Calhoun County AL 47.5 37.1
## Only_One_dose Covid_Deaths Total_Deaths Census2019
## 1 77.0 693 6122 288000
## 2 69.5 175 1448 96849
## 3 46.7 221 1781 108317
## 4 56.4 171 1309 55869
## 5 65.1 600 5915 223234
## 6 57.9 589 4147 113605
This plot shows three linear regressions lines corresponding to three different vaccination scenarios in the US.
Blue shows Vaccinated with only one dose , green represents people who are fully vaccinated with a booster shot
Red represents people who are fully vaccinated with two doses of the coronavirus vaccine.
## County State Fully_Vaccinated Vacinated+Booseter Only_One_dose
## 71 Imperial County CA 95.0 41.5 95.0
## 686 Queens County NY 87.7 39.1 95.0
## 840 Montour County PA 80.3 57.4 92.1
## 939 Cameron County TX 80.4 39.9 95.0
## 947 El Paso County TX 82.6 39.7 95.0
## 960 Hidalgo County TX 76.9 36.0 95.0
## 974 Maverick County TX 95.0 33.3 95.0
## 998 Webb County TX 95.0 27.0 95.0
## Covid_Deaths Total_Deaths Census2019 COVID-19 Deaths over Total Deaths
## 71 825 3430 181215 0.2405248
## 686 8859 39468 2253858 0.2244603
## 840 782 3585 18230 0.2181311
## 939 2417 10117 423163 0.2389048
## 947 4296 19985 839238 0.2149612
## 960 3838 16310 868707 0.2353158
## 974 250 1131 58722 0.2210433
## 998 1186 5010 276652 0.2367265
## COVID-19 Deaths over Population COVID-19 Deaths per Capita
## 71 1.327290e-06 240.5248
## 686 9.958938e-08 224.4603
## 840 1.196550e-05 218.1311
## 939 5.645692e-07 238.9048
## 947 2.561386e-07 214.9612
## 960 2.708805e-07 235.3158
## 974 3.764234e-06 221.0433
## 998 8.556835e-07 236.7265
Unemployment raw value (Percentage of population ages 16 and older unemployed but seeking work)
Uninsured raw value (Percentage of population under age 65 without health insurance)
Poor or Fair Health raw value (Percentage of adults reporting fair or poor health (age-adjusted)
Hypothesis - States with higher Unemployment ratings or higher Uninsured people will result large percentages of people with poor, or fair health.
Higher Uninsured Ratings does lead to higher Poor or Fair Health Ratings but Unemployment does not.
#Poor mental health day in a particular population.
Hypothesis- Less Mental health providers provided in an area will likely to lead to a person having more Poor mental health days.
Yes, this shows that less mental providers leads to people having more poor mental health days
Hypothesis - Better income inequality ratios, unemployment rates and high school completion rates will be associated with a lower Years of Potential Life Lost rate but there will be confounding variables that will not allow us to definitely conclude that these socioeconomic variables will lead to decreased Years of Potential Life Lost rates per racial group
income: Ratio of household income at the 80th percentile to income at the 20th percentile.
unemployment: Percentage of population ages 16 and older unemployed but seeking work.
hs_completion: Percentage of adults ages 25 and over with a high school diploma or equivalent.
premature_deaths: Years of potential life lost before age 75 per 100,000 population (age-adjusted). This is a count of the number of years instead of the number of deaths.
mental_health: Ratio of population to mental health providers.
primary_care: Ratio of population to primary care physicians.
## [1] "Name" "Premature death raw value"
## [3] "5-digit FIPS Code" "fips"
## [5] "premature_deaths"
Hypothesis:
## [1] "hs_completion" "income" "unemployment" "primary_care"
## [5] "mental_health"
## [1] "county" "fips" "preventable_AIAN"
## [4] "hs_completion" "income" "unemployment"
## [7] "state" "primary_care" "mental_health"
## [10] "preventable_Asian" "preventable_Black" "preventable_Hispanic"
## [13] "preventable_White" "uninsured" "premature"
## # A tibble: 3,142 × 18
## fips abbr county.x pop_2…¹ count…² preve…³ hs_co…⁴ income unemp…⁵ state
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 01001 AL Autauga Cou… 55347 Autaug… NA 0.885 5.09 0.0273 AL
## 2 01003 AL Baldwin Cou… 203709 Baldwi… 3128 0.908 4.39 0.0273 AL
## 3 01005 AL Barbour Cou… 26489 Barbou… NA 0.732 5.98 0.0380 AL
## 4 01007 AL Bibb County 22583 Bibb C… NA 0.791 5.00 0.0306 AL
## 5 01009 AL Blount Coun… 57673 Blount… NA 0.805 4.43 0.0267 AL
## 6 01011 AL Bullock Cou… 10696 Bulloc… NA 0.747 5.63 0.0363 AL
## 7 01013 AL Butler Coun… 20154 Butler… NA 0.850 5.01 0.0365 AL
## 8 01015 AL Calhoun Cou… 115620 Calhou… NA 0.844 4.84 0.0354 AL
## 9 01017 AL Chambers Co… 34123 Chambe… NA 0.816 4.80 0.0293 AL
## 10 01019 AL Cherokee Co… 25859 Cherok… NA 0.816 4.83 0.0291 AL
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## # mental_health <dbl>, preventable_Asian <dbl>, preventable_Black <dbl>,
## # preventable_Hispanic <dbl>, preventable_White <dbl>, uninsured <dbl>,
## # premature <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## # ³preventable_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
## [1] "Name"
## [2] "Preventable hospital stays raw value"
## [3] "5-digit FIPS Code"
## [4] "fips"
## [5] "preventable_stays"
A random forest model was initially chosen because the RMSE of the model was less than that of the KNN model and multivariate linear regression.
The predictions for the larger model of our outcome variables (premature deaths and preventable hospital stays) were plotted against the observed values on the y-axis to see how well the model was at predicting both premature deaths and preventable hospital stays considering our explanatory variables in this particular model.
We used 50 trees as the number for the Random Forest and used the features: Race, Income Inequality, Unemployment Rate, High School Completion Rate, Ratio of Primary Care Physicians and Ratio of Mental Health Providers, Number of Uninsured adults, and either preventable hospital stays of premature deaths raw value depending on the response variable (Years of Potential Life Lost per 100,000 population or Preventable Hospital Stays per 100,000 population).
Here the motivation was to see: - If certain explanatory variables were more important to certain races versus other races. - How well the model was at predicting certain counties with certain confounding factors (population size, presence of a Native American reservation, or other systematic factors).
Random Forest Modeling + A Population Variable The motivation for joining a table with county population data to the county health rankings was to see if the random forest model would perform better with this variable.
Logistic Regression
The goal of using logistics regression analysis is to see if one can create predictions, based on former Covid data, the probability of a county having high Covid rates in the future.
## NULL
## [1] "county" "fips" "premature_AIAN"
## [4] "hs_completion" "income" "unemployment"
## [7] "state" "primary_care" "mental_health"
## [10] "premature_Asian" "premature_Black" "premature_Hispanic"
## [13] "premature_White" "uninsured" "preventable"
## # A tibble: 3,142 × 18
## fips abbr county.x pop_2…¹ count…² prema…³ hs_co…⁴ income unemp…⁵ state
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 01001 AL Autauga Cou… 55347 Autaug… NA 0.885 5.09 0.0273 AL
## 2 01003 AL Baldwin Cou… 203709 Baldwi… NA 0.908 4.39 0.0273 AL
## 3 01005 AL Barbour Cou… 26489 Barbou… NA 0.732 5.98 0.0380 AL
## 4 01007 AL Bibb County 22583 Bibb C… NA 0.791 5.00 0.0306 AL
## 5 01009 AL Blount Coun… 57673 Blount… NA 0.805 4.43 0.0267 AL
## 6 01011 AL Bullock Cou… 10696 Bulloc… NA 0.747 5.63 0.0363 AL
## 7 01013 AL Butler Coun… 20154 Butler… NA 0.850 5.01 0.0365 AL
## 8 01015 AL Calhoun Cou… 115620 Calhou… NA 0.844 4.84 0.0354 AL
## 9 01017 AL Chambers Co… 34123 Chambe… NA 0.816 4.80 0.0293 AL
## 10 01019 AL Cherokee Co… 25859 Cherok… NA 0.816 4.83 0.0291 AL
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## # mental_health <dbl>, premature_Asian <dbl>, premature_Black <dbl>,
## # premature_Hispanic <dbl>, premature_White <dbl>, uninsured <dbl>,
## # preventable <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## # ³premature_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
## # A tibble: 5,639 × 15
## fips abbr county.x pop_2…¹ count…² hs_co…³ income unemp…⁴ state prima…⁵
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 01001 AL Autauga Cou… 10.9 Autaug… 0.885 5.09 0.0273 AL 4.68e-4
## 2 01001 AL Autauga Cou… 10.9 Autaug… 0.885 5.09 0.0273 AL 4.68e-4
## 3 01003 AL Baldwin Cou… 12.2 Baldwi… 0.908 4.39 0.0273 AL 7.02e-4
## 4 01003 AL Baldwin Cou… 12.2 Baldwi… 0.908 4.39 0.0273 AL 7.02e-4
## 5 01003 AL Baldwin Cou… 12.2 Baldwi… 0.908 4.39 0.0273 AL 7.02e-4
## 6 01005 AL Barbour Cou… 10.2 Barbou… 0.732 5.98 0.0380 AL 3.22e-4
## 7 01005 AL Barbour Cou… 10.2 Barbou… 0.732 5.98 0.0380 AL 3.22e-4
## 8 01007 AL Bibb County 10.0 Bibb C… 0.791 5.00 0.0306 AL 5.36e-4
## 9 01007 AL Bibb County 10.0 Bibb C… 0.791 5.00 0.0306 AL 5.36e-4
## 10 01009 AL Blount Coun… 11.0 Blount… 0.805 4.43 0.0267 AL 2.07e-4
## # … with 5,629 more rows, 5 more variables: mental_health <dbl>,
## # uninsured <dbl>, preventable <dbl>, race <chr>, premature_deaths <dbl>, and
## # abbreviated variable names ¹pop_2015, ²county.y, ³hs_completion,
## # ⁴unemployment, ⁵primary_care
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = new_rf_df, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 5639
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.07069812
## R squared (OOB): 0.6813557
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = pt_NA, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 331
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.1424433
## R squared (OOB): 0.4745165
## [1] "Maverick County" "Blaine County" "Roosevelt County" "Benewah County"
## [5] "Cass County" "Woodbury County" "Neshoba County"
## [1] "TX" "OK" "MT" "ID" "MN" "IA" "MS"
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = pt_Asian, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 371
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.09979983
## R squared (OOB): 0.3048306
## [1] "Garfield County" "Washington County"
## [3] "Outagamie County" "Matanuska-Susitna Borough"
## [5] "Johnson County" "Hawaii County"
## [7] "Washington County"
## [1] "OK" "AR" "WI" "AK" "TX" "HI" "UT"
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = pt_Black, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 1319
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.0672776
## R squared (OOB): 0.3322235
## [1] "Chaves County" "Logan County" "Miller County" "Jefferson County"
## [5] "Saline County" "Jackson County" "Lafayette County"
## [1] "NM" "WV" "GA" "OH" "IL" "TX" "MO"
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = pt_Hispanic, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 860
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.06681635
## R squared (OOB): 0.3476807
## [1] "Las Animas County" "Rio Arriba County" "Carbon County"
## [4] "Anderson County" "Quay County" "Otero County"
## [7] "Conejos County"
## [1] "CO" "NM" "UT" "TX" "NM" "CO" "CO"
## Ranger result
##
## Call:
## ranger(premature_deaths ~ ., data = pt_White, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 2758
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.0422079
## R squared (OOB): 0.5256403
## [1] "Petersburg city" "Alpine County" "Covington city" "Dolores County"
## [5] "McCormick County" "Union County" "Monroe County"
## [1] "VA" "CA" "VA" "CO" "SC" "FL" "AR"
We found that our model cannot predict for certain outlier counties due to random effects of variables we cannot measure like number of random deaths in a smaller counties
We were able to see that while including population in our random forest improved the predictability of the model, it did so very slightly.
Per race, the explainability of the current socioeconomic and access/quality of clinical care variables varies and this is because of confounding variables that exist to explain counties of high years of potential life lost. It may be the case that there are systematic differences on a county-by-county basis across the different racial groups that we just cannot measure or that we don’t have access to.
## [1] "county" "fips" "preventable_AIAN"
## [4] "hs_completion" "income" "unemployment"
## [7] "state" "primary_care" "mental_health"
## [10] "preventable_Asian" "preventable_Black" "preventable_Hispanic"
## [13] "preventable_White" "uninsured" "premature"
## # A tibble: 3,142 × 18
## fips abbr county.x pop_2…¹ count…² preve…³ hs_co…⁴ income unemp…⁵ state
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 01001 AL Autauga Cou… 55347 Autaug… NA 0.885 5.09 0.0273 AL
## 2 01003 AL Baldwin Cou… 203709 Baldwi… 3128 0.908 4.39 0.0273 AL
## 3 01005 AL Barbour Cou… 26489 Barbou… NA 0.732 5.98 0.0380 AL
## 4 01007 AL Bibb County 22583 Bibb C… NA 0.791 5.00 0.0306 AL
## 5 01009 AL Blount Coun… 57673 Blount… NA 0.805 4.43 0.0267 AL
## 6 01011 AL Bullock Cou… 10696 Bulloc… NA 0.747 5.63 0.0363 AL
## 7 01013 AL Butler Coun… 20154 Butler… NA 0.850 5.01 0.0365 AL
## 8 01015 AL Calhoun Cou… 115620 Calhou… NA 0.844 4.84 0.0354 AL
## 9 01017 AL Chambers Co… 34123 Chambe… NA 0.816 4.80 0.0293 AL
## 10 01019 AL Cherokee Co… 25859 Cherok… NA 0.816 4.83 0.0291 AL
## # … with 3,132 more rows, 8 more variables: primary_care <dbl>,
## # mental_health <dbl>, preventable_Asian <dbl>, preventable_Black <dbl>,
## # preventable_Hispanic <dbl>, preventable_White <dbl>, uninsured <dbl>,
## # premature <dbl>, and abbreviated variable names ¹pop_2015, ²county.y,
## # ³preventable_AIAN, ⁴hs_completion, ⁵unemployment
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = new_rf_df2, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 6070
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.2463433
## R squared (OOB): 0.2326381
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = longer_NA, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 394
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.5744974
## R squared (OOB): -0.03572857
## NULL
## character(0)
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = longer_Asian, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 415
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.3241022
## R squared (OOB): 0.1340968
## NULL
## character(0)
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = longer_Black, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 1438
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.2828635
## R squared (OOB): -0.004559765
## NULL
## character(0)
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = longer_Hispanic, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 1055
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.3937324
## R squared (OOB): 0.05917461
## NULL
## character(0)
## Ranger result
##
## Call:
## ranger(preventable_stays ~ ., data = longer_White, num.trees = 50, importance = "impurity")
##
## Type: Regression
## Number of trees: 50
## Sample size: 2768
## Number of independent variables: 11
## Mtry: 3
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.1114373
## R squared (OOB): 0.2995497
## NULL
## character(0)
## NULL
## character(0)
## Results
We found that our model cannot predict for certain outlier counties due to random effects of variables we cannot measure like number of random accidents in a smaller counties
We were able to see that while including population in our random forest improved the predictability of the model, it did so very slightly.
Per race, the explainability of the current socioeconomic and access/quality of clinical care variables varies and this is because of confounding variables that exist to explain counties of preventable hospital stays
Compared to Years of Potential Life Lost, preventable hospital stays is lower across counties in America no matter the race. However, our model predicts not as well for the variable of preventable hospital stays compared to premature deaths (Years of Potential Life Lost).
It may be the case that there are systematic differences on a county-by-county basis across the different racial groups that we just cannot measure or that we don’t have access to.
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 1
## [75] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0
## [112] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
## [260] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
## [297] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1
## [408] 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 0
## [630] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1
## [667] 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
## [815] 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
## [852] 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
## [963] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
## [1000] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0
## [1074] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Call:
## glm(formula = high_indicator ~ `COVID-19 Deaths over Population`,
## family = "binomial", data = comb_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0025 -0.6780 -0.3785 -0.0892 5.7338
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3861 0.1322 -2.921 0.00349 **
## `COVID-19 Deaths over Population` -2.2480 0.2573 -8.737 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 971.83 on 1122 degrees of freedom
## Residual deviance: 813.32 on 1121 degrees of freedom
## AIC: 817.32
##
## Number of Fisher Scoring iterations: 7
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 1
## [75] 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0
## [112] 0 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
## [260] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
## [297] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1
## [408] 0 1 1 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 1 0
## [630] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1
## [667] 1 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
## [815] 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
## [852] 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
## [963] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
## [1000] 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1
## [1037] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0
## [1074] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Call:
## glm(formula = high_indicator ~ Covid_Deaths, family = "binomial",
## data = comb_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7497 -0.4645 -0.4373 -0.4266 2.2133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.399e+00 1.159e-01 -20.688 < 2e-16 ***
## Covid_Deaths 3.540e-04 5.635e-05 6.282 3.34e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 809.01 on 1122 degrees of freedom
## Residual deviance: 759.37 on 1121 degrees of freedom
## AIC: 763.37
##
## Number of Fisher Scoring iterations: 5
##Calibration Plot for Logistic Regression
Calibration plot was made to show how well the logistic regression is performing.
##Prediction Model
## [1] "County" "State"
## [3] "Fully_Vaccinated" "Vacinated+Booseter"
## [5] "Only_One_dose" "Covid_Deaths"
## [7] "Total_Deaths" "Census2019"
## [9] "COVID-19 Deaths over Total Deaths" "COVID-19 Deaths over Population"
## [11] "COVID-19 Deaths per Capita" "high_indicator"
Conclusion:
Having a Vaccination matters in preventing Covid Deaths
Racial stratification increases socioeconomic disadvantage and other risk factors for poor health among minorities (Native American, Black, etc.) compared to Whites.
While some socioeconomic factors like income inequality can begin to capture the inequitable distribution of wealth, power and opportunity which create these lasting racial stratification, we simply do not have the measurable variables available to better predict what is exactly causing high death rates or low survival rates across race and location.
Certainty some systematic differences exist for certain marginalized group on a county-by-county basis
The Limitations for the County Health Ranking dataset include:
Future Work:
For the COVID-19 data, we would like to see a Poisson regression in order to predict COVID-19 deaths as well as American counties with high COVID-19 transmission Rates.
For access to care, we would like to see policy recommendations be researched and developed from the data analysis that was done for the access to care variables. Access to care can lead to practical applications like where to map new facilities that assist health like for mental health on a county by county basis as a way to best impact these under-resourced areas.
For socioeconomic factors, we would like to see what types of operations or clinic visit types are most prevalent to different social or economically marginalized groups in the country. From there, we can see where the healthcare system fails certain racial groups. We would also like to dive deeper into systematic issues that might be captured with our variables that we did not have at our disposal.