Our data is based on American killings that occurred from police in the early part of the year 2015. The data gives us personal information about the deceased person (i.e. age, race, name, ethnicity, armed/not armed, etc.), but it also gives general information based on the census that can tell us about the environment around them when the death occurred (i.e. poverty rate, unemployment rate, income, area’s college percentage). The context behind this data was that data on police killings began to become very biased or flawed in order to fit a certain narrative, so a source (The Guardian) wanted to present a more raw data set on police killings that was more informative and representative than simply race, age, or whether they were armed. Some of the variables we used were the latitude and longitude points of the death, age/ethnicity, economic attributes, education levels, cause of death, armed weaponry used if used at all.
More specifically the variables we use are:
As we explore this particular research question, we want to look at an overview of the different causes of death during these police killings. We created a bar graph that shows the various ways people have died as a result from the police. We can see that gunshot is the overwhelmingly dominant cause of death; in fact, 88% of the victims in our dataset died from being shot by the police. The huge gap between gunshot and other causes is clear, as all other causes combined account for only 12% of the deaths. As such, we want to explore the dependencies between the choice of gunshot and the victims’ race and ethnicity background or their armed condition.
First, we study if the decision of gunshot can be influenced by race-ethnicity. Because there a quite a few race-ethnicity groups present in the data, to simplify the problem a bit, we re-categorized the race-ethnicity of the victims into white and minority groups. Since we are only interested in studying the decision of gunshot, we split the victims into gunshot and not-gunshot groups (i.e. simply combine non-gunshot causes of deaths into one group). From the mosaic plot shaded by Pearson residuals displayed above, we see that none of the blocks is colored red or blue. This implies there was no significant difference in the probability of experiencing gunshot, whether the victims were from the minority group or the white group.
Next, to explore whether the decision of gunshot was influenced by whether the victims were armed or not, we split the victims into armed and unarmed groups. From the mosaic plot, we again see that no block is colored red or blue. This implies, perhaps a bit surprisingly, that the unarmed victims experienced gunshot at a rate that was not statistically significantly different from the armed victims. In fact, the empirical probability of gunshot is 0.8767 for the armed group, and 0.8922 for the unarmed group; the difference is trivial.
To gain a sense of the distribution, we will use the national bucket variable. National bucket shows an integer rating of 1-5, where 1 means the victim comes from a county that falls into the poorest 20% of all counties within the U.S.; 2 means the poorest 20% to 40%, and so on. In other words, national bucket shows us the quintile a county belongs to economically. We plot the geographic locations of the police killing incidents, and color the points by their national bucket values:
We can immediately tell that there are more green points than purple points. This means the majority of victims come from poorer counties, compared to the U.S. national level. The colors of the points are distributed somewhat evenly in the middle part of the U.S. However, there seem to be roughly two clusters of purple points, where one cluster is located on the coast of California, and the other in the North East area. This should not be entirely surprising, as these two regions are wealthier in the first place. Regardless, this still tells us that the economic status of the incidents’ locations are not uniformly distributed throughout the U.S., but rather have some dependency on the geographic location.
Now, because many points overlap in those two clusters, to confirm that they indeed correspond to higher economic status compared to the locations of police killings elsewhere, we draw a contour plot where the height or level corresponds to national bucket.
The levels for locations without a police killing incident were generated using a LOESS smoothing model. The model regresses national bucket on longitude, latitude, as well as their interaction term in order to make it more flexible. (It is worth keeping in mind that because the model extrapolates, some locations are given non-integer or even negative levels, which does not fit the definition of national bucket. However, the contour plot is still a decent visualization here for showing the general patterns in the data.) We see that indeed California and the North East are clusters of wealthier incident locations, since their contour colors are more yellow compared to the rest of the U.S., and yellow corresponds to higher national bucket levels.
We consider the college variable, which is the share of population that is 25 years or older with Bachelors degree or higher. In other words, the higher the college value, the higher the education level of a district. From the histogram, we found that most police killings happened in relatively low education level areas (below 0.2).
We then explored whether the victims had different education levels given their race-ethnicity groups. In the histogram, we used blue bars and pink bars to represent the education level distributions of the white group and the minority group. The blue line and the red line represent the mean of education levels of each group. We can see that these two groups have different mean education levels and some differences in the distribution of education levels. However, so far it is hard to tell whether the differences are significant or not.
From the ECDF graphs, we find that the empirical cumulative distribution of whites has an overall lower cumulative percentage throughout the graph when compared to the ECDF of other races, grouped together as “minorities”. The differences between the two ECDF graphs seem to be significant, and we could test this significance by using a Kolmogorov-Smirnov (KS) test below:
##
## Two-sample Kolmogorov-Smirnov test
##
## data: whites$college and minority$college
## D = 0.21159, p-value = 6.047e-05
## alternative hypothesis: two-sided
The KS test shows us that there is a significant difference between the education levels for white victims and the minority grouped victims who were killed by the police. The \(p\)-value is 6.047e-05 which is lower than 0.05, thus there is enough evidence to reject the null hypothesis, that the two distributions are the same.
The police killings dataset provides key insights on the demographic populations of Americans killed by the police at the start of the year 2015. Using mosaic plots grouping race/ethnicity and armed condition on gunshots, we find that there is no significant difference in the probability of experiencing gunshot whether or not the deceased had weapons on hand or whether or not the deceased were from minority groups. Using contour plots, we find that the economic status of California and the North East are clusters of wealthier incident locations. Finally, using ECDF graphs and the KS test, we find that there is a significant difference in the education levels of whites vs. minorities among the deceased victims. Overall, the police killings dataset provides us with valuable information on the economic status, race/ethnicity, education levels, and geographic locations of the victims from police killings.
Since many variables in the dataset are related to broad regions (e.g. counties), our analysis of the social factors can be biased. The results would be more precise and meaningful if we could gather more specific information about the smaller neighborhood the deceased lived in, so that the data would be more closely related to the deceased victims themselves. Our data might also be a bit limited and outdated as they are only from the first half of 2015. For future work, we could study variables like unemployment rate, personal income, and household income, which we did not cover in this report.