The Crime Dataset in the Ecdat library compiles crime data in different counties of North Carolina. The dataset has 630 observations of 24 variables. Each observation corresponds to one of the counties in North Carolina in one of the observation years. Each observation includes information such as region (central, west, or other) and smsa (yes/no for if the observation is in an urban area) which are both nominal categorical variables. It also includes quantitative variables including the county identifier (county), the year the crime took place between 1981 and 1987 (year), crimes committed per person (crmrte), probability of arrest (prbarr), probability of conviction (prbconv), probability of a prison sentence (prbpris), average prison sentence in days (avgsen), police per capital (polpc), people per square mile in hundreds (density), tax revenue per capita (taxpc), percentage minority in 1980 (pctmin), weekly wage in construction (wcon), weekly wage in transportation and utilities (wtuc), weekly wage in whole sales and retail trade (wtrd), weekly wage in finance, insurance, and real estate (wfir), weekly wage in service industry (wser), weekly wage in manufacturing (wmfg), weekly wage of federal employees (wfed), weekly wage of state employees (wsta), weekly wage of local government employees (wloc), mix of offense ie. face-to-face/other (mix), and percentage of young males (pctymle). We also manually look at Wikipedia’s North Carolina county mappings to determine which counties correspond to the county identifier in the dataset.
Using this dataset, we would like to examine which variables seem to be associated with crime and how exactly they are associated. Specifically, we would like to answer three main questions:
The first idea we want to explore is related to the frequency of crime as measured by crime rate, the number of crimes committed per person. More specifically, we want to learn about the association between police per capita and crime rate, and to do this we examine the variables polpc and crmrte.
In answering this research question, we treat our data as cross sectional. More specifically, we look at the average values of covariates of interest per county across the time period 1981-1987. Some of our later analysis will focus on examining time trends.
We start our analysis by examining average crime rates. Looking at the choropleth map above, we see that the crime rate across counties in North Carolina seems to be pretty varied. However, it does seem to be the case that counties in the Western portion of North Carolina generally see lower crime rates as compared to other regions. To better examine this assumption, we plot the density of and examine the summary statistics for crime rate across counties in the West, Central, and Other regions of North Carolina.
## [1] "Central"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004581 0.022313 0.031329 0.035613 0.047485 0.098966
## [1] "Other"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01065 0.02156 0.03123 0.03478 0.04395 0.16384
## [1] "West"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001812 0.013523 0.018305 0.019756 0.026153 0.044164
In analyzing the above plot and summary, we see that the variance in crime rate for the West region is much smaller and furthermore, the distribution of crime rate in the West is unimodal with a peak at a crime rate value around 0.01 lower than the peaks of the Central and Other region. Moreover, we see that the distributions of the Central and Other region seem to have similar means, medians, and general shapes. Why is it the case that the West region has a different distribution of crime than the other regions? We now look to understand what factors may be associated with crime rate.
A factor that might be associated with crime rate is police per capita. One possible association is that as police per capita increases, crime rate decreases indicating a somewhat causal relationship. A second possibility is that counties with higher rates of crime decide to increase the size of their police force, resulting in a positive association between crime rate and police per capita. It is unclear which effect we will see in our data and whether the association depends on other factors. Thus, our first aim will be to better understand the association between crime rate and police per capita.
The choropleth map above shows police per capita across counties. We see that for the most part, the police per capita seems to be pretty similar across counties, except for four counties with slightly larger values.
From the two choropleth maps alone, it is hard to see an overarching association between police per capita and crime rate. Though we see that in the four counties that have higher police per capita have lower crime rates, across counties with similar police per capitas we see varying crime rates. Thus, we now move to examining a scatterplot of crime rate vs police per capita.
##
## Call:
## lm(formula = AvgCrimeRate ~ AvgPolicePerCap + region + region *
## AvgPolicePerCap, data = subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.060267 -0.010171 -0.003056 0.007952 0.063253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035195 0.001533 22.966 < 2e-16 ***
## AvgPolicePerCap 0.232234 0.628204 0.370 0.711747
## regionother -0.005210 0.001920 -2.713 0.006846 **
## regionwest -0.012823 0.002276 -5.635 2.65e-08 ***
## AvgPolicePerCap:regionother 2.333195 0.693562 3.364 0.000815 ***
## AvgPolicePerCap:regionwest -1.428854 0.790589 -1.807 0.071193 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01594 on 624 degrees of freedom
## Multiple R-squared: 0.232, Adjusted R-squared: 0.2258
## F-statistic: 37.69 on 5 and 624 DF, p-value: < 2.2e-16
From the scatterplot above, we see three different associations between crime rate and police per capita. Namely, counties in the other region seem to show a largely positive association, counties in the west region seem to show a largely negative association, and counties in the central region seem to show no real association.
Moreover, our regression analysis indicates that the difference in associations are statistically significant. Specifically, we see that the interaction between police per capita and the other region is statistically significant and negative, meaning that between two counties in the other region whose police per capita differ by one, the county with the larger police per capita would be expected to have a crime rate that is 2.333195 higher on average. We see a similar result for the west region except with a negative association (significance at 10% level).
Thus, we see from examining the association between police per capita and crime rate that increased regulation isn’t necessarily associated with reduced crime, but rather the association depends on region.
We also wanted to better understand the severity of crime, specifically what is the association between probability of prison sentence and average sentence. We consider the probability of prison sentence variable to be a proxy to show strictness of regulation, while average sentence to be a proxy for severity of crime, assuming more severe crimes would result in higher average sentences. Our hypothesis is that having a higher probability of prison sentence (aka more strict regulation) reduces the severity of crimes that are committed and thus reduces average sentence. We will look at these variables in the context of whether or not the county is in an urban area, as we also hypothesize that cities may have different levels of severity, due to population density and lifestyle differences.
We begin analysis by looking at the probability of prison sentence and average sentence across counties in a urban area vs not by looking at side-by-side boxplots for each variable. In the plots above, we see higher average sentences and probability of prison sentences when counties are in an urban area, indicating urban counties do appear to have higher crime severity, however they also seem to have increasingly strict regulation, indicating our original hypothesis may not be correct.
To continue examining the association, we plot a scatterplot of average sentence and probability of prison sentence with regression lines. As the plot above is colored by whether the county is in an urban area or not, we can compare the association between for counties in urban areas vs those in non-urban areas. For counties in urban areas, an increased probability of prison sentence appears to be correlated with lower average sentences, indicating our original hypothesis that stricter regulation leads to decreased severity of crime may be correct. However, for counties in non-urban areas, there appears to be no relationship between probability of prison sentence and average sentence.
Finally, as the previous analysis considered the data as cross sectional, we now focus on examining time trends. We examine the time series of probability of prison sentence and of average sentence, to see if there are similar trends with the variables over time that may explain the relationships between the variables. For both counties in urban areas and non-urban areas, we see similar trends across both variables, with those an urban area consistently having higher values. For average sentence, we see a trend that decreases until 1982, stagnates and then increases starting in 1985, and for probability of prison sentence, we see a decreasing trend over time.
Lastly, we wanted to look at how the various variables that are associated with crime compare between counties in urban areas vs. counties that are not.
We first conduct principal component analysis as the dataset had many quantitative variables. We perform this to understand how many variables are needed to account for most of the variation in the data. Based on the elbow plot below we can see that we can use the first 5 principal components to account for 95% of the variation in the data.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2037 1.4857 1.34592 1.15822 1.11731 1.03072 1.01378
## Proportion of Variance 0.2208 0.1003 0.08234 0.06098 0.05674 0.04829 0.04672
## Cumulative Proportion 0.2208 0.3211 0.40342 0.46440 0.52114 0.56943 0.61615
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.96261 0.95468 0.9357 0.92292 0.89401 0.8813 0.81705
## Proportion of Variance 0.04212 0.04143 0.0398 0.03872 0.03633 0.0353 0.03034
## Cumulative Proportion 0.65827 0.69969 0.7395 0.77821 0.81454 0.8498 0.88018
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.79152 0.74716 0.57913 0.55728 0.53080 0.4875 0.38454
## Proportion of Variance 0.02848 0.02537 0.01524 0.01412 0.01281 0.0108 0.00672
## Cumulative Proportion 0.90866 0.93403 0.94928 0.96339 0.97620 0.9870 0.99373
## PC22
## Standard deviation 0.37150
## Proportion of Variance 0.00627
## Cumulative Proportion 1.00000
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 0.22075 0.10033 0.08234 0.06098 0.05674 0.04829 0.04672 0.04212 0.04143 0.03980
## PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20
## 0.03872 0.03633 0.03530 0.03034 0.02848 0.02537 0.01524 0.01412 0.01281 0.01080
## PC21 PC22
## 0.00672 0.00627
We now look at a graph plotting the first two principal components, colored the data points by the smsa variable, representing if the observation of the county is in an urban area or not. We can see from the plot above that low PC1 values are usually attributed with urban areas and high PC1 values are usually corresponding to non-urban areas as shown from the majority of red data points on the right and blue data points on the left. However, within this plot of the first two principal components we can see that the distributions overlap indicating that there is not enough evidence to conclude the urban and non-urban areas are significantly different in their principal components.
To summarize, in this analysis, we examined three different research questions looking at the Frequency of Crime, Severity of Crime, and General Attributes of Crime. From examining visualizations and performing various tests, we concluded the following:
In our temporal analysis, we were limited by the amount of time the data was recorded over. Since the data is from only a 7 year period, it may be beneficial to collect data over a longer time period to get a better idea of crime trends over time.
A second limitation of our analysis was that we did not analyze causal relationships, but rather looked at associations between variables. Studying causality could provide better insights into what factors cause crime to increase or decrease, which could prove useful to regulatory authorities. Future work could look into implementing methods of causal inference, such as instrumental variables analysis.