Data Description

The “Hate Crimes Dataset” is obtained from Github contributed by FiveThirtyEight, an American website that focuses on opinion poll analysis, politics, economics, and sports blogging. This is the data behind the story “Higher Rates of Hate Crimes Are Tied To Income Inequality.” There are 51 rows and 12 columns in the original dataset. Each row represents data for a state. We exclude Hawaii from our analysis since it is not encoded in the U.S. state graph in R. The data involves both categorical and quantitative variables, which supports a diverse range of potential data explorations for us to study. In this project, we mainly focus on the share of white poverty, the share of the population with high school degrees, and the share of population in metropolitan areas, and the average annual hate crimes per 100k population between 2010-2015. We also introduce more categorical variables like hate crime level, gini index level, and metropolitan population level in the dataset to conduct more comprehensive analysis.

Research Questions

There are three research questions that we intend to answer using the “Have Crimes Dataset”:

Is there a difference between the average number of hate crimes across different states in the U.S?
If so, what are some possible factors that may contribute to this difference?
Do these factors make statisticaly significant differences towards hate crimes?

Research Question 1

Our first research question is to understand the spatial distribution of average annual hate crimes by state. To achieve this, we take a look at the following areal plot.

In the average annual hate crime map above, it seems that there are fewer hate crimes in the southern US. In the western US and the north central US, there are more hate crimes. In the northeastern US, the number of hate crimes seems to be mixed. But is this difference significant? To assess this, we use the following dendrogram where the distance between states are measured by average annual hate crime.

From the dendrogram, many of the southern states (colored in purple) tend to cluster together. Thus, it appears that southern states tend to have more similar average annual hate crime occurrences, which means fewer hate crimes than other parts of the US. Also, some of the western states (colored in blue) tend to cluster together, which means similar and relatively high average annual hate crime occurrences. These findings align with our observation in the spatial plot.

Research Question 2

Our second research question is to identify which factors might contribute to the average annual hate crimes. Since we have several variables that may be related to the hate crimes, we decide to first use the PCA analysis to plot the original variables on top of the first two principal components. We include four original variables in the PCA analysis, which are the share of the population that lives in metropolitan areas, the share of adults 25 and older with a high-school degree, the share of white residents who are living in poverty, and the Gini Index. We color the points by introducing a new categorical variable, the hate crime level. Specifically, if the average annual hate crime is less than 2, the hate crime level corresponds to “low”. If it is larger than 3, the level is classified as “high”, and otherwise “medium”.

From the PCA graph, we can see that states with a high crime rate level (colored in red) are located mainly in the lower half, especially the lower right corner. Thus, it seems that states with a high crime rate level tend to have a higher share of the population that lives in metropolitan areas. It also seems that states with a high crime rate level tend to have a higher Gini Index. Intuitively, this makes sense because a higher Gini Index indicates a larger inequality in the income. States that are more metropolitan and have larger inequality may breed more hate crime. This reasoning, combined with the result from the PCA analysis, motivates us to take a closer look at these two variables in our later research questions.

Research Question 3.1 - Gini Index

For first part of the third research question, we focus on understanding whether Gini Index is related to hate crime rate level. From our PCA analysis, we notice that states with higher Gini Index seem to also have higher crime rate. We decide to plot a scatterplot of Gini Index v.s. crime rate overlaid by a linear regression fit. From the graph, we see that there seems to be a weak positive relationship between the two variables.

We notice that there is one very big outlier that might affect the overall relationship, so we decide to further analyze the relationship between Gini Index and hate crime rate using smoothed density plot and statistical test. To create a smooth density plot, we introduce a new categorical variable, Gini Index Level. Specifically, if the Gini Index is less than or equal to 0.45, the Gini Index is considered as “low” and otherwise “high”.

From the graph, it seems like the distributions for different Gini Index levels are relatively different. The center of the distribution given the low Gini Index seems to be bigger by 1, and it has a larger density at the center than the distribution of hate crimes given the high Gini Index is high.

To better assess the difference, we run a t-test, thereby assessing whether the average hate crime differs between low Gini Index and high gini index. From the result, we see that the p-value is 0.7343>0.05, thus we fail to reject the null hypothesis. Therefore, we can not conclude that the level of Gini Index has an impact on average hate crime.

Research Question 3.2 - Share of Metro Population

The second part of the third research question explores whether the share of population that lives in metropolitan areas affects the number of hate crimes. For the convenience of our study, we divide this share into two levels: low and high. We first use a stacked histogram to identify the general relationship between the metro population level and hate crimes.

We can split the stacked histogram into two parts from hate crimes number equals three. When the number of hate crimes is smaller than three, low metro population level contributes more to the counts comparing with high metro population level. However, when the number of hate crimes is higher than three, high metro population level dominates the counts of different numbers of hate crimes. From this, we observe a relationship between metro population level and the number of hate crimes. When the metro population level is high, the state is more likely to have high number of hate crimes as well. This also intuitively make sense, because high metro population level corresponds to cities. There is a potentially larger wealth gap in cities, and therefore could trigger more hate crimes. To test whether our observation is statistically significant, we created a mosaic plot between three hate crime levels and two metro population levels.

From the Mosaic Plot, none of the squares have any color, which means that none of the value pairs are statistically significant. Hence, we fail to prove our observation to make any statistical sense. We further conduct a chi-square test to test for independence but observe a non-significant p-value as well. However, we still acknowledge the trend we observed from the stacked histogram but fail to back up the observation with a formal statistical test. Lots of potential reasons may contribute to this situation. The data in the real-world is never perfect but full of noise and potentially with sampling bias. Hence, further exploration regarding this factor can be conducted, which is one direction for future research.

Conclusion

In this project, we try to understand trends in US annual hate crimes rate by state and determine factors that affect such rate. Through areal plot and dendrogram, we discover that western and north central US seem to have higher hate crimes rate and southern US seem to have lower hate crimes rate. Using PCA analysis, we identify two potential variables that seem to closely relate with hate crimes rate, Gini Index and share of population living in metropolitan areas.

We analyze the relationship of each of these two variables with hate crime via multiple visualizations and statistical tests. From the visualizations, we discover that low gini index and higher share of metropolitan population seem to be related to higher crime rate. Yet from our t-test and mosaic plots results, none of the relationships are statistically significant.

In all, there are a lot of factors that seem to be related to the hate crimes rate, yet further analysis is necessary to determine the most important factors in affecting the annual crime rate in the US.