Across the country, around 50,000 people are pulled over for a traffic stop per day. This totals nearly 20 million people per year. Though these stops are deemed to be fairly routine, there has been some disagreement over what gets people pulled over. Some highly debated factors that individuals believe have led to them getting stopped have been race, gender, location, and more. The motivation behind our project, though not a manual for avoiding the police, is to statistically show the factors that lead to individuals getting pulled over more often. The data we will use to explore this idea is from Connecticut’s Hartford Police Department recorded between October of 2013 and September of 2016.
The variables of the data are as follows:
date
: Denoted by Year-Month-Daytime
: In military time or 24 hour clock formatlocation
: The street address in Hartford, Connecticut
the stop took placelat
: The latitude, in degrees, where the stop took
placelong
: The longitude, in degrees, where the stop took
placedistrict
: The neighborhood in Hartford, Connecticut
where the stop took placesubject_age
: The age of the individual pulled oversubject_race
: The race of the individual pulled over.
Race is either Asian/Pacific islander, Black, Hispanic, White,or other
in this dataset.subject_sex
:: The sex of the individual pulled over
(Male of Female only)officer_id_hash
: The unique code denoting the police
officer conducting the stopdepartment_name
: Denotes the department name of who
carried out the stop (Only Hartford for this dataset)type
: The type of transportation mode was pulled over
(Only vehicular for this dataset)arrest_made
: True or False depending on if an arrest
was madecitation_issued
: True or False depending on if a
citation was givenwarning_isued
: True or False depending on if a warning
was givenoutcome
: Given in words whether a citation, warning,
arrest, or NA occured during the stopcontraband_found
: True or False depending on if illegal
contraband was found in the vehiclesearch_conducted
: True or False depending on if the
individual was searchedsearch_vehicle
True or False depending on if the vehicle
was searchedsearch_basis
: The officer’s reasoning for conducting a
search either consent, probable cause, or NAreason_for_stop
: The listed reason why the officer
pulled over the individual. Reasoning can either be administrative
offense, cell phone, defective lights, display of plates, equipment
violation, moving violation, registration, seatbelt violation, speed
related, stop sign, suspended license, STC violation, window tint,
other. The reasoning can also be a combination of any of the above
reasonsraw_intervention_disposition_code
: Code W, V, U, N, M,
I depending on the status of the caseraw_subject_race_code
: The race of the individual with
a single letter code (A, B, W, O)raw_subject_ethnicity_code
: The ethnicity of the
individual either N for not Hispanic or Latino or H for Hispanic or
Latino or a combinationraw_search_authorization_code
: N if no search was
authorized and O if there was an authorized search of the individual or
vehicle.
Using these variables we hope to answer three
overarching questions about police stops.
What demographic factors are associated with being pulled over more?
How does the time of day, month or even year effect the frequency of traffic stops?
What locations within Hartford Connecticut are associated with higher amounts of traffic stops?
To highlight peaks, data values, outliers, and other critical facets of the data, most of the graphs and visualizations presented are interactive.
We begin our analysis of the first question through a univariate barplot of the reasons listed for traffic stops.
# Creating a dataframe out of a frequency table to graph
reason_for_stop <- table(stop_data$reason_for_stop)
sorted_reasons <- reason_for_stop %>%
as.data.frame() %>%
arrange(desc(Freq))
# head(sorted_reasons)
# Renaming columns and subsetting to only take the top 10 rows
sorted_reasons_top10 <- sorted_reasons[1:15,]
sorted_reasons_top10 <- sorted_reasons_top10[order(-sorted_reasons_top10$Var1),]
colnames(sorted_reasons_top10) = c("Reason_for_Stop", "Number_of_People")
# Graphing
baseplot <- sorted_reasons_top10 %>%
ggplot(aes(y = reorder(Reason_for_Stop, Number_of_People), x = Number_of_People)) +
geom_bar(stat = "identity", fill = '#99c140') +
scale_x_continuous(limits = c(0, 3800)) +
xlab("Number of people stopped") +
ylab("Reason for Stop") +
ggtitle("Top 15 reasons for traffic stops in Hartford, CT")
reason_stop_plotly <- ggplotly(baseplot)
reason_stop_plotly
The primary reason for traffic stops in this dataset were due to people using their cell phones while driving (3770 people stopped), with a close runner-up being stopped due to a speed-related reason (3534 people stopped). The third reason seems to actually have been the “miscellaneous” reason for stops in the “Other” category, and other driving violations follow behind it.
We now move into an analysis of what age groups have their car searched more often, and of those searched, who is more likely to have been arrested.
# Setup
Stat_Graphics_Police_Stop_Data <- stop_data
Stat_Graphics_Police <- dplyr::filter(Stat_Graphics_Police_Stop_Data, search_vehicle == c("TRUE", "FALSE"))
# Graphing
density_plot <- Stat_Graphics_Police %>%
ggplot(aes(x = subject_age, color = search_vehicle)) +
geom_density(adjust = 0.75)+
scale_color_manual(values = c("#cc3232", "#99c140"))+
facet_grid(~arrest_made)+
labs(title = "Density Estimate by Arrests Made", color = "Vehicle Searched", x = "Driver's Age", y = "Density")
density_plotly <- ggplotly(density_plot)
density_plotly
Drivers between the age of 20 to 30 appear to be pulled over most often by the police. Of individuals arrested, young drivers below the age of 25 were more likely to be arrested after their vehicle was searched. This may indicate some form of contraband is more likely to be found in the vehicles of younger drivers. For drivers between the ages of 25 to 45, they were arrested more often without a vehicle search, likely due to some other crime found without searching the car. There is also an odd peak in arrests, regardless of whether the vehicle was searched, between the ages of 45 to 50 that may allude to a lurking variable. For drivers who were not arrested, drivers between the ages of 25 to 35 were more likely to have their car searched. There is a very large peak within drivers around the age of 25 who had their car searched that appears to be overrepesented within this data. Both plots appear to be right skewed though both have an odd peak for middle aged individuals in being pulled over.
Following this, we wanted to investigate if men or women are stopped more, and what the proportions were for the outcomes (of the stop) for each group given age.
# Creating the different age groups for the plot
stop_data.no_na <- subset(stop_data,!is.na(stop_data$outcome))
stop_data.no_na$age_groups <- ifelse(stop_data.no_na$subject_age > 80, "80+",
ifelse(stop_data.no_na$subject_age > 60, "61-80",
ifelse(stop_data.no_na$subject_age > 40, "41-60",
ifelse(stop_data.no_na$subject_age > 20, "21-40", "20 and under"))))
stop_data.no_na <- subset(stop_data.no_na,!is.na(stop_data.no_na$age_groups))
# Graphing
stacked_bar_plot <- stop_data.no_na %>%
ggplot(aes(x = subject_sex, fill = outcome)) +
geom_bar() +
facet_wrap(~age_groups) +
scale_fill_manual(values=c('#cc3232', '#99c140', '#e7b416')) +
xlab("Subject Sex") +
ylab("Number of people stopped") +
ggtitle("Subjects stopped in Hartford, CT (2013-2016)")
stacked_bar_plotly <- ggplotly(stacked_bar_plot)
stacked_bar_plotly
This plot is a stacked bar plot that’s facetted by age group. We can see from this that most traffic stops occurred to people who are in the 21-40 age group across both genders, which makes sense, since that’s probably the age of the majority of drivers across the US (and thus also in CT). Additionally, across all age groups, more men are stopped than women. Interestingly though, the approximate proportions of outcomes to stops are about equal across genders for all age groups. The majority of people get citations, less people than that get a warning, and a few are arrested.
To finish our exploration of the age variable, we wanted to see if the distribution of ages for people who were stopped was normally distributed.
ks.test(stop_data$subject_age, "pnorm")
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: stop_data$subject_age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Given a p-value of 2.2e-16 and a significance level of 0.05, we reject the null hypothesis and find evidence that the distribution of ages is not normally distributed. This makes sense, because it seems like most people who were stopped were between 20-40, which would result in a fairly asymmetric distribution. Additionally, this age group is the one that has the most drivers on the road–as people get older, they tend to drive less, resulting in a right-skewed curve.
We then wanted to more closely examine how demographic data such as race or gender affects arrests made so we created a plot to breakdown how the proportion of arrests made (arrest made vs. no arrest made) differed based on the demographic.
# setup
police <- stop_data
# makes new dataframe for proportion of arrest
arrest_proportions <- police %>%
group_by(subject_race, subject_sex) %>%
summarise(proportion_arrest = mean(arrest_made))
# Side-by-side bar plot: Race, Gender, Arrest Made (by proportion)
sidebyside1 <- arrest_proportions %>%
ggplot(aes(x = subject_race, y = proportion_arrest, fill = subject_sex)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values=c('#99c140', '#cc3232')) +
labs(title = "Arrests by Race and Gender",
x = "Race",
y = "Proportions of Arrests",
fill = "Gender") +
theme_minimal()
sidebyside1_plotly <- ggplotly(sidebyside1)
sidebyside1_plotly
The above graph suggests that Asian and White males get arrested at a higher proportion than Asian and White females respectively. Additionally, Hispanic females get arrested at a greatly higher rate than Hispanic males.
We also wanted to examine whether there were certain races that were being disproportionately represented in terms of arrests made when compared to the demographic proportion of that race in Hartford.
# https://www.census.gov/quickfacts/fact/table/hartfordcityconnecticut/PST045222
# We used the census data on Hartford, CT from the above link to extract the proportions of each racial group that lives in Hartford, and created the following dataframe
hartford_race_df <- data.frame(
race = c("white", "black", "hispanic", "asian/pacific islander", "other"),
Hartford = c(27.8, 36.4, 2.6, 45.5, 12.6))
# Calculating racial proportions from our traffic stop dataset
race_counts <- table(police$subject_race)
race_proportions <- (race_counts / sum(race_counts)) * 100
police_race <- data.frame(
race = names(race_proportions),
Police = as.numeric(race_proportions)
)
# Merging to create a dataframe for racial proportions of Hartford vs. racial proportions of stops made
comparison_df <- merge(
police_race,
hartford_race_df,
by = "race",
all.x = TRUE
)
comparison_df$Police <- comparison_df$Police
colnames(comparison_df)[2] ="Traffic Stops"
comparison_df_long <- pivot_longer(comparison_df, cols = c("Traffic Stops", "Hartford"), names_to = "Variable", values_to = "Value")
# Graphing
race_proportions <-ggplot(comparison_df_long, aes(x = race, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Race Proportions in Traffic Stops Made vs. Hartford Census",
y = "Percentage",
x = "Race"
) +
scale_fill_manual(values = c("#99c140", "#cc3232")) +
theme_minimal()
race_proportions_plotly <- ggplotly(race_proportions)
race_proportions_plotly
Based on this comparison of race data from the Census in 2016, with our data, we can discern that Hispanics are greatly over-represented in the stops made compared to the proportion of Hispanics in Hartford. This may be due to the large Hispanic population (22%) in Hartford are not citizens or do not reside in Hartford (but drive to work in Hartford), so they may be underrepresented in the traffic stop dataset.
# Load the data
df <- stop_data
# Point-pattern map: Where stops were made (with lat/long column)
# Have the frequency of getting arrested (T/F)
# Setting up the variable
us_counties <- map_data("county")
ct_hartford <- us_counties %>%
subset(region == "connecticut") %>%
subset(subregion == "hartford")
# Removing all the observations where the district is NA
df_noDistNA <- subset(df, !is.na(df$district))
# Grabbing the observations by district and arrest made
n_arrest <- group_by(df_noDistNA, district, arrest_made) %>%
summarize(count = n())
n_arrest <- filter(n_arrest, arrest_made == TRUE)
# Calculating the average lng and lat for each district
n_arrest_lng <- df_noDistNA %>%
group_by(district) %>%
summarize(avgLng = mean(lng))
n_arrest_lat <- df_noDistNA %>%
group_by(district) %>%
summarize(avgLat = mean(lat))
n_arrest$lng <- n_arrest_lng$avgLng
n_arrest$lat <- n_arrest_lat$avgLat
library(sf)
hartford_leaf <- st_read("tl_2019_09003_roads 2/tl_2019_09003_roads.shp")
## Reading layer `tl_2019_09003_roads' from data source
## `/Users/nat/Documents/School/College/Year 2/Fall/Stat Graphics/tl_2019_09003_roads 2/tl_2019_09003_roads.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 19929 features and 4 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -73.02372 ymin: 41.54472 xmax: -72.40944 ymax: 42.03817
## Geodetic CRS: NAD83
ggplot(data = ct_hartford, mapping = aes(x = long, y = lat)) +
geom_tile(aes(group = group, fill = factor(1))) +
coord_fixed(xlim = c(-72.75, -72.625), ylim = c(41.71, 41.825), expand = FALSE) +
scale_fill_manual(values = c("#DDDDDD"), guide = "none") +
geom_point(data = n_arrest, aes(x = lng, y = lat,
size=count), alpha=0.5) +
guides(fill = FALSE) +
theme(legend.title = element_text(size = 8)) +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Point-pattern map of traffic stops in Hartford")
For illustration purposes, here’s a supplemental image of the above point-pattern map graph with an approximate overlay of Hartford’s districts.
Along with the location of traffic stops and the reasons, we created a point-pattern map to display the district with the most number of arrests. We wanted to see the simplified version of the populated area of arrest. To do so, each plotted point represents a different district in Hartford, based on the district’s estimated longitude and latitude. Based on the points in the plot, we can see that the Downtown district illustrates a bigger point, implying that there are more arrests made around Downtown. Other than Downtown district, there are some surrounding districts that might show slightly high number of arrests, but Downtown is the biggest point by far.
To further support the evidence that more people are getting arrested in Downtown, we generated a heatmap to show the frequency.
# Heat map: Where arrests were made (kriging)
# Just use the part a but make it so that it is colored by district, gradient color
ggplot(data = df_noDistNA, mapping = aes(x = lng, y = lat)) +
stat_density2d(aes(fill = after_stat(density)), geom = "tile", contour = FALSE) +
coord_fixed(xlim = c(-72.75, -72.625), ylim = c(41.71, 41.825), expand = FALSE) +
theme(legend.title = element_text(size = 8)) +
scale_fill_gradient(low="orange", high="blue") +
xlab("Longitude") +
ylab("Latitude") +
ggtitle("Heat map of traffic stops in Hartford")
For illustration purposes, here’s a supplemental image of the above heat map graph with an approximate overlay of Hartford’s districts.
Based on the heatmap, we can also see that the area near the Downtown district has a higher frequency compared to the other districts. Surprisingly, the Blue Hills district and South Meadows district also seem to be showing slightly higher density.
To further explore the reasons for traffic stops in our data set, we created this interactive map with each of the stop locations in Hartford, colored by the reasons for each stop. Each point is put onto the street of the traffic stop, which can be more easily viewed by zooming in on any particular section of Hartford.
# Setup
library(sf)
police <- stop_data
hartford_leaf <- st_read("tl_2019_09003_roads 2/tl_2019_09003_roads.shp")
## Reading layer `tl_2019_09003_roads' from data source
## `/Users/nat/Documents/School/College/Year 2/Fall/Stat Graphics/tl_2019_09003_roads 2/tl_2019_09003_roads.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 19929 features and 4 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -73.02372 ymin: 41.54472 xmax: -72.40944 ymax: 42.03817
## Geodetic CRS: NAD83
incident_table <- table(police$reason_for_stop)
# Convert the table to a data frame
incident_df <- as.data.frame(incident_table)
# Rename the columns for clarity
colnames(incident_df) <- c("Incident_Type", "Count")
library(RColorBrewer)
map <- leaflet() %>%
addTiles() %>%
addPolygons(data = hartford_leaf)
# Check and filter out rows with missing or invalid lat/lon values
police <- police[complete.cases(police$lat, police$lng), ]
# Define a color palette based on incident_type
pal <- colorFactor(
palette = "Set1", # You can choose a different color palette
domain = unique(police$reason_for_stop)
)
# Add circle markers for different reasons for arrest
map <- map %>%
addCircleMarkers(
data = police,
lat = ~lat,
lng = ~lng,
radius = 5, # Adjust the radius as needed
color = ~pal(reason_for_stop), # Use color palette based on incident_type
fillOpacity = 0.7,
popup = ~as.character(reason_for_stop) # Display incident type in popups
)
# Print the map
map