When discussing the quality of life in a specific location or region, many people tend to compare and contrast living factors between different countries and even continents. However, the assessment and analysis of quality of life between the United States, its states, and even counties are often overlooked by the average citizen. Although there may be certain similarities, one county in the United States can often drastically differ from another in aspects such as population size, healthcare and education access, and even water quality.
In our project, we will look deeper into the measurable quality of life factors within the states and counties of the United States using a quality of life dataset located in Kaggle. This includes an exploration of the different demographic and socioeconomic factors, crime and safety rates, and sociopolitical factors present within each region of the country. Our goal with this project is to identify patterns and trends that offer valuable insights into the living conditions and societal dynamics within different regions. This type of information can be invaluable for a wide range of individuals, from potential homebuyers seeking areas with good quality of life factors to politicians looking at regions that require more improvement to create effective policies.
As mentioned previously, our dataset is taken from Kaggle and looks at different quality of life factors for every county in the United States. The dataset contains 3,134 unique observations, with each observation representing a single county in the United States. There are also 16 primary variables that we will take into consideration for this project. The descriptions for each of these variables are provided below:
LSTATE
: The state that the county resides in
(categorical)NMCNTY
: The name of the county (categorical)FIPS
: Federal Information Processing Standards code,
which is a unique identifier for counties (categorical)LZIP
: A representative ZIP code for the county
(categorical)ULOCALE
: Urban-centric locale code, which shows the
county’s urban-rural status (categorical)X2022.Population
: Number of people in the county for
2022 (quantitative)X2016.Crime.Rate
: Number of crimes per 1000 people
(quantitative)Unemployment
: Percentage of people unemployed in the
county (quantitative)2020PopulrVoteParty
: Most popular political party in
the county based on 2020 election results (R = Republican, D = Democrat)
(categorical)2020.PopulrMajor.
: Percentage of population who voted
for the most popular political party (quantitative)AQI.Good
: Air Quality Index (AQI) status, measuring the
percent of days where the air quality is rated as ‘Good’
(quantitative)Cost.of.Living
: Average annual cost of living in the
county (quantitative)X2022.Median.Income
: Median annual income in the county
(quantitative)Stu.Tea.Rank
: County rank for student-teacher ratio
(ordinal)Diversity.Rank..Race.
: County rank for racial diversity
level (ordinal)Diversity.Rank..Gender.
: County rank for gender
diversity level (ordinal)For this project, our aim is to address the following three key research questions:
How do demographic and socioeconomic factors such as population size, urbanization level, median household income, and unemployment rates interact with each other?
What are the key factors influencing crime rates at the county/state level?
How do educational and quality of life factors affect a county’s popular political party?
To answer this question, we first wanted to examine the entire United States map as a whole through two choropleth maps that display the populations of each county within the United States and their median income figures for 2022. Having two choropleth maps of the same scale using the same color scales assists in the visualization of the two variables, especially when examining and analyzing which areas of the population are high or low, and in which areas median income are high or low, and if they overlap.
First, this choropleth map shows the population distribution across the United States on a logarithmic scale, which is displayed by the varying shades of orange. Darker colors represent areas with a higher log of population, which highlights urban centers where population density is highest. The coastal areas, particularly along the Northern East Coast, Southern East Coast, and the West Coast, show significantly darker shades, which suggests large metropolitan areas with dense populations. In contrast, the central part of the country, which is likely to be more rural, shows lighter shades, indicating smaller population sizes. With this graph, a logarithmic scale was used to allow for a more detailed visual comparison of population sizes across a large range, especially because it minimizes the relative difference between the highest and lowest values. It also should be noted that the map does not include graphics for Alaska or Hawaii.
On the other hand, this choropleth map of the United States shows the median income levels for each county in the United States. Like the map above, this map also utilizes the same color gradient, ranging from light to dark shades of orange, indicating the range of median incomes across the country. Darker shades equate to higher median incomes, which are notably more common in areas that are likely to be more urban/suburban. These darker regions are more irregularly distributed, with greater median incomes within the cities of the East and West Coasts, which are typically regions with high costs of living and higher-paying job markets. Lighter shades encompass the majority of the map, suggesting that most of the country has lower to medium median incomes. This distribution shows the economic differences that often correlate with geographic, and urban/rural divides.
Now we look to analyze how median income influences the cost of living in different counties, given the level of urbanization for the counties. This can be seen through the faceted scatter plot below.
These are a series of scatter plots that compare the cost of living to the median income in 2022 across various types of urbanization. Each plot represents a different category of urban area, from large cities (labeled “11-City: Large”) to remote rural areas (“43-Rural: Remote”). The points in each plot show individual data points for counties or cities, with the median income on the x-axis and the cost of living on the y-axis.
From the plots, it can be seen that as the level of urbanization decreases, from large cities to remote rural areas, the group of points tends to shift downwards, indicating a general decrease in both the cost of living and median income. Large urban areas show a wider spread of data points, indicating the diversity in income and living costs within such regions. The data seems to suggest that while income potential might be higher in larger cities, the associated cost of living is also higher, and this relationship appears to become less stronger as one moves to smaller towns and rural areas.
Finally, we want to look at a dendrogram to see if there are noticeable similarities between counties (labeled by levels of urbanization) based on unemployment rates.
This graph above is the complete linkage dendrogram for unemployment rates, with the labels at the bottom representing counties colored by levels of urbanization (city- purple, town- red, suburb- blue, rural- green). The y-axis measures the level of dissimilarity between clusters, showing how different the unemployment rates are between the counties. In general, clusters that join at lower points on the y-axis are more similar to each other, and those that join at higher points are less similar.
This dendrogram indicates that there are a few large clusters and many smaller ones, which could mean that although some counties have similar unemployment rates, there is a large amount of variation across different counties. In this dendrogram, the coloring by urbanization level is small and difficult to fully look through and analyze in detail, especially due to the abundance of rural counties (represented by green). Therefore, because the clusters are mixed, it is difficult to discern noticeable similarities between counties coded by the urbanization levels in terms of unemployment rates.
Before analyzing how different economic factors affect the crime rate of a county, we’ll first analyze the distribution of crime rates across the U.S.
The above plot is a density plot of crime rate. The crime rate in the dataset signifies the number of crimes per thousand people in a population. For instance, a county with a crime rate of 0.025 has approximately 25 crimes occur per 1000 people. The data was converted by dividing the number of crimes per thousand to represent it in decimals. The distribution appears to be skewed to the right with a median at around 0.02. The median value can be utilized to assess how dangerous a county is by setting the threshold as the median value. If a county’s crime rate is above the median value which is about 0.02, it can be interpreted as unsafe compared to other county’s and vice versa. It appears that the normal distribution does not capture the empirical distribution’s right-tail behavior which indicates that the distribution is not normal. Such observation was somewhat expected as there will be less counties with high crime rate as every county tries to keep their crime rate low.
Now that we’ve analyzed the distribution of crime rates across the U.S. we can now look into how different factors affect crime rate, beginning with cost of living.
From this heat map, we can see that there are a high number of counties with a cost of living around 60,000 dollars that have a crime rate of around 1.25%. Additionally, for counties that have a cost of living between 50,000 dollars and 75,000 dollars, there is a high number of them that have a crime rate that is less than or equal to 2.5%. On the other hand, for the counties with a cost of living higher than 100,000 dollars, there is a low number of them with a crime rate less than or equal to 2.5%. Finally for the counties that have a cost of living between 50,000 and 75,000 dollars, there is a low count of them that have a crime rate higher than 5%, and for the counties with a cost of living higher than 100,000, there are almost none with a crime rate higher than 5%. Overall, we can see that counties with a higher cost of living (> 100,000 dollars) there is a very low count of them that have any percentage of crime rate, while for counties with a lower cost of living (50,000 - 75,000), there is a high count of them with a crime rate from 1.25-2.5%.
Lastly, we will analyze how income levels and geographic location affect crime rates.
The above plot represents the 2 dimensional representation produced by multi-dimensional scaling, also known as MDS, on a scatterplot of the variables median income and crime rate. Furthermore, to add novelty to the plot, the points were colored by region which was done by manipulating the data by grouping states into corresponding regions. The plot indicates that there is not a difference amongst the regions of the United States when it comes to median income and crime rate since there is only a singular mode which is shared amongst all 5 regions. The Southeast and West region appear to be more scattered compared to the other regions which may be accounted for by the fact that the two regions consist of a large number of states and the high number of population compared to other regions. On the other hand, the southwest regions appear to be the most tightly clustered and this may be due to the fact that the southwest regions consist of the least number of states.
We’ll first explore how geography can affect a county’s political party preferences.
This graph is a choropleth plot representing the most popular political party in each state based on the most popular party in each states’ counties. We can observe that despite looking at each individual county, states overall tend to lean much more towards one party over the other. Based on the plot, we can conclude that counties on the East and West coast of the U.S. favor the Democrat Party and counties on the Midwest and South favor the Republican party.
We’ll now look at how social and educational factors can affect a county’s political preferences. In particular, we’ll analyze how a county’s racial diversity ranking and teacher student ratio affect a county’s political preferences.
This graph is a proportional, side-by-side barplot comparing the average student-teacher ratio rank (shown on the y-axis) of counties favoring Democrats and counties favoring Republicans. These counties are split into four diversity level groups and are shown on the x-axis; Low (bottom 25% of rankings), Medium Low (bottom 25%-50%), Medium High (top 25%-50%), and High (top 25%). Looking at the barplot, we’ll see that counties with lower diversity levels have low average student-teacher ratio rankings while high diversity level counties have high average student-teacher ratio rankings Therefore, we can conclude that there’s a positive relationship between racial diversity level and student-teacher ratio rankings. Looking at specific bars, We can first observe that for the counties with a low diversity level, Republican leaning counties have a lower student-teacher ratio ranking than Democrat leaning counties. We can also see that Democrat leaning and Republican leaning counties with a medium low and medium high diversity level don’t differ significantly between each other with respect to average student-teacher ratio rankings. Finally, we can observe that the Democrat leaning counties with a high diversity level have a lower average student-teacher ratio ranking than Republican leaning counties with a high diversity level.
Finally, we’ll analyze how a county’s population and income level affects its political preferences. However, before analyzing our scatter plot, we’ll do a linear regression analysis on the Democrat leaning counties and Republican leaning counties to determine if the relationship between median income and population is different for the two groups. It is important to conduct this analysis because verifying that the relationships are similar between groups will allow us to make fair comparisons between Republican and Democrat counties regarding population and income trends within each group. Because Los Angeles county heavily skews the population data, we took the log of the population to make its distribution more consistent.
##
## Call:
## lm(formula = X2022.Median.Income ~ X2022.Population, data = qLife_data_Dem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110264 -11849 -3010 8485 106298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.119e+04 5.678e+02 125.36 <2e-16 ***
## X2022.Population 1.212e-02 1.182e-03 10.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18880 on 1243 degrees of freedom
## Multiple R-squared: 0.078, Adjusted R-squared: 0.07726
## F-statistic: 105.2 on 1 and 1243 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = X2022.Median.Income ~ X2022.Population, data = qLife_data_Rep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50680 -9268 -1061 8151 62687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.424e+04 3.277e+02 196.014 <2e-16 ***
## X2022.Population 1.247e-02 1.459e-03 8.549 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13530 on 1887 degrees of freedom
## Multiple R-squared: 0.03729, Adjusted R-squared: 0.03677
## F-statistic: 73.08 on 1 and 1887 DF, p-value: < 2.2e-16
Beginning with the Democrat group, the coefficient estimate for the log of the population is 6702.9, meaning that when the log of the population increases by 1, the median income for Democrat counties increases by 6702.9. The adjusted R-squared for the Democrat group is 0.2567, meaning that 25.67% of the variation seen in median income for Democrat leaning counties can be explained by population. The coefficient estimate of the log of the population for the Republican group is 2076.7, meaning that when the log of the population increases by 1, the median income for Republican counties increases by 2076.7. The adjusted R-squared for the Republican group is 0.04645, meaning that 4.645% of the variation seen in median income for Republican leaning counties can be explained by population.
While the Democrat group has a higher coefficient estimate, both coefficient estimates are positive, meaning that there’s a positive relationship between population and median income for both groups.Though the percents of variation accounted for by population differ by a large margin, this difference isn’t a significant issue since we’re only looking at one variable for this analysis. Thus, we can conclude that the relationship between median income and population mean is similar enough between Republican and Democrat counties to make fair comparisons of trends between the two groups.
Now that we’ve ensured that we can make fair comparisons between the two groups, we will analyze how a county’s population and income level affects its political preferences. This graph is a scatterplot displaying the log of population (shown on the x-axis) and median income (shown on y-axis) of each county. We chose to log the population because the population of Los Angeles county heavily skews the data. We also grouped counties by their political preferences, and created a smooth regression line for the two groups. We can observe that Republican leaning counties tend to have lower populations and median incomes than Democrat leaning counties. Based on the regression lines, we can also observe that the median income of Republican leaning counties increases with population at a slower rate than Democrat leaning counties, which lines up with our previous regression analysis. From this graph, we can conclude that counties with a lower population and median income tend to lean Republican and counties with a higher population and median income tend to lean Democrat.
Our goal with this analysis was to assess the quality of life in the United States, specifically exploring aspects such as median income, cost of living, unemployment rates, crime rates, and how they might differ among different regions in the U.S. The first research question we analyzed in an attempt to achieve this goal was, “How do demographic and socioeconomic factors interact with each other?” Primarily, we found that population size and the levels of income were slightly similar in terms of geographic location. We also observed relationships between median income, cost of living, and levels of urbanization. However, we did not see large similarities in counties coded by levels of urbanization in terms of unemployment rates.
The second research question we wanted to investigate was, “What are the key factors influencing crime rates at the county/state level?” After creating some graphs, we found that there was not a difference in median income and crime rate amongst different regions in the U.S. Additionally, we found that many counties with a cost of living under $75,000 have a crime rate from 1.25-2.5%, while there were very few counties with a higher cost of living that had any percentage of crime rate. We also found that the median crime rate value for all U.S. counties is 2%, so if a certain county has a much higher crime rate than this value, it can be considered unsafe.
Our final research question is “How do educational and quality of life factors impact a county’s popular political party?” We found that more counties with median incomes and smaller populations are more likely to be a Republican majority, while counties with higher populations and median incomes are more likely to be a Democratic majority. We also found that counties on the East and West coasts tend to lean Democrat, and counties in the Midwest and South tend to lean Republican. Finally, we were able to see a relationship between political parties for regions and student-teacher ratios, given diversity rankings.
We found this quality of life dataset to be extremely expansive, with many different variables and factors to consider when analyzing the quality of life in the U.S. However, there are many other ways we can deepen our research and understanding of this topic. For example, we could analyze the question “How do family and home dynamics correlate with unemployment rate by region?” Specifically, we could collect data on the number of family members, or specific domestic circumstances and see if it has a relationship with unemployment rate, cost of living, or other economic qualities that play into quality of life. Another important opportunity for future work with this dataset is to take healthcare into consideration. Healthcare is a highly crucial factor in determining quality of life, so we could analyze healthcare access and quality of healthcare per county, to determine which regions of the U.S. have a higher quality of life in terms of healthcare. Additionally, we could analyze relationships between healthcare and median income, or healthcare and crime rate, to determine if these variables have a statistically significant relationship. On the other hand, a necessary limitation to reflect on is that quality of life isn’t solely determined by economic conditions or crime rates, but also cultural and social factors that this dataset may not have been able to adequately capture. To remedy this, we could survey different people about their social relationships/friendships, as well as their personal passions or job fulfillment. This data may have substantial impacts on quality of life, that we could analyze further in future projects.