Introduction

Across the country, around 50,000 people are pulled over for a traffic stop per day. This totals nearly 20 million people per year. Though these stops are deemed to be fairly routine, there has been some disagreement over what gets people pulled over. Some highly debated factors that individuals believe have led to them getting stopped have been race, gender, location, and more. The motivation behind our project, though not a manual for avoiding the police, is to statistically show the factors that lead to individuals getting pulled over more often. The data we will use to explore this idea is from Connecticut’s Hartford Police Department recorded between October of 2013 and September of 2016.

The variables of the data are as follows:


Using these variables we hope to answer three overarching questions about police stops.

  1. What demographic factors are associated with being pulled over more?

  2. How does the time of day, month or even year effect the frequency of traffic stops?

  3. What locations within Hartford Connecticut are associated with higher amounts of traffic stops?


To highlight peaks, data values, outliers, and other critical facets of the data, most of the graphs and visualizations presented are interactive.

Univariate Analysis

We begin our analysis of the first question through a univariate barplot of the reasons listed for traffic stops.

# Creating a dataframe out of a frequency table to graph
reason_for_stop <- table(stop_data$reason_for_stop)
sorted_reasons <- reason_for_stop %>%
  as.data.frame() %>%
  arrange(desc(Freq))
# head(sorted_reasons)

# Renaming columns and subsetting to only take the top 10 rows
sorted_reasons_top10 <- sorted_reasons[1:15,]
sorted_reasons_top10 <- sorted_reasons_top10[order(-sorted_reasons_top10$Var1),]
colnames(sorted_reasons_top10) = c("Reason_for_Stop", "Number_of_People")

# Graphing
baseplot <- sorted_reasons_top10 %>%
  ggplot(aes(y = reorder(Reason_for_Stop, Number_of_People), x = Number_of_People)) +
  geom_bar(stat = "identity", fill = '#99c140') +
  scale_x_continuous(limits = c(0, 3800)) +
  xlab("Number of people stopped") +
  ylab("Reason for Stop") +
  ggtitle("Top 15 reasons for traffic stops in Hartford, CT")

reason_stop_plotly <- ggplotly(baseplot)
reason_stop_plotly

The primary reason for traffic stops in this dataset were due to people using their cell phones while driving (3770 people stopped), with a close runner-up being stopped due to a speed-related reason (3534 people stopped). The third reason seems to actually have been the “miscellaneous” reason for stops in the “Other” category, and other driving violations follow behind it.

Demographic Analysis of Traffic Stops: Race, Age, and Gender

We now move into an analysis of what age groups have their car searched more often, and of those searched, who is more likely to have been arrested.

# Setup
Stat_Graphics_Police_Stop_Data <- stop_data

Stat_Graphics_Police <- dplyr::filter(Stat_Graphics_Police_Stop_Data, search_vehicle == c("TRUE", "FALSE"))

# Graphing
density_plot <- Stat_Graphics_Police %>%
  ggplot(aes(x = subject_age, color = search_vehicle)) +
  geom_density(adjust = 0.75)+
  scale_color_manual(values = c("#cc3232", "#99c140"))+
  facet_grid(~arrest_made)+
  labs(title = "Density Estimate by Arrests Made", color = "Vehicle Searched", x = "Driver's Age", y = "Density")

density_plotly <- ggplotly(density_plot)
density_plotly

Drivers between the age of 20 to 30 appear to be pulled over most often by the police. Of individuals arrested, young drivers below the age of 25 were more likely to be arrested after their vehicle was searched. This may indicate some form of contraband is more likely to be found in the vehicles of younger drivers. For drivers between the ages of 25 to 45, they were arrested more often without a vehicle search, likely due to some other crime found without searching the car. There is also an odd peak in arrests, regardless of whether the vehicle was searched, between the ages of 45 to 50 that may allude to a lurking variable. For drivers who were not arrested, drivers between the ages of 25 to 35 were more likely to have their car searched. There is a very large peak within drivers around the age of 25 who had their car searched that appears to be overrepesented within this data. Both plots appear to be right skewed though both have an odd peak for middle aged individuals in being pulled over.

Following this, we wanted to investigate if men or women are stopped more, and what the proportions were for the outcomes (of the stop) for each group given age.

# Creating the different age groups for the plot
stop_data.no_na <- subset(stop_data,!is.na(stop_data$outcome))
stop_data.no_na$age_groups <- ifelse(stop_data.no_na$subject_age > 80, "80+", 
                                 ifelse(stop_data.no_na$subject_age > 60, "61-80",
                                        ifelse(stop_data.no_na$subject_age > 40, "41-60",
                                               ifelse(stop_data.no_na$subject_age > 20, "21-40", "20 and under"))))
stop_data.no_na <- subset(stop_data.no_na,!is.na(stop_data.no_na$age_groups))

# Graphing
stacked_bar_plot <- stop_data.no_na %>%
  ggplot(aes(x = subject_sex, fill = outcome)) +
  geom_bar() +
  facet_wrap(~age_groups) +
  scale_fill_manual(values=c('#cc3232', '#99c140', '#e7b416')) +
  xlab("Subject Sex") +
  ylab("Number of people stopped") +
  ggtitle("Subjects stopped in Hartford, CT (2013-2016)")

stacked_bar_plotly <- ggplotly(stacked_bar_plot)
stacked_bar_plotly

This plot is a stacked bar plot that’s facetted by age group. We can see from this that most traffic stops occurred to people who are in the 21-40 age group across both genders, which makes sense, since that’s probably the age of the majority of drivers across the US (and thus also in CT). Additionally, across all age groups, more men are stopped than women. Interestingly though, the approximate proportions of outcomes to stops are about equal across genders for all age groups. The majority of people get citations, less people than that get a warning, and a few are arrested.

To finish our exploration of the age variable, we wanted to see if the distribution of ages for people who were stopped was normally distributed.

ks.test(stop_data$subject_age, "pnorm")
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  stop_data$subject_age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Given a p-value of 2.2e-16 and a significance level of 0.05, we reject the null hypothesis and find evidence that the distribution of ages is not normally distributed. This makes sense, because it seems like most people who were stopped were between 20-40, which would result in a fairly asymmetric distribution. Additionally, this age group is the one that has the most drivers on the road–as people get older, they tend to drive less, resulting in a right-skewed curve.

We then wanted to more closely examine how demographic data such as race or gender affects arrests made so we created a plot to breakdown how the proportion of arrests made (arrest made vs. no arrest made) differed based on the demographic.

# setup
police <- stop_data

# makes new dataframe for proportion of arrest
arrest_proportions <- police %>%
  group_by(subject_race, subject_sex) %>%
  summarise(proportion_arrest = mean(arrest_made))

# Side-by-side bar plot: Race, Gender, Arrest Made (by proportion)
sidebyside1 <- arrest_proportions %>%
  ggplot(aes(x = subject_race, y = proportion_arrest, fill = subject_sex)) +
  geom_bar(stat = "identity", position = "dodge") + 
  scale_fill_manual(values=c('#99c140', '#cc3232')) +
  labs(title = "Arrests by Race and Gender",
       x = "Race",
       y = "Proportions of Arrests", 
       fill = "Gender") +
  theme_minimal()

sidebyside1_plotly <- ggplotly(sidebyside1)
sidebyside1_plotly

The above graph suggests that Asian and White males get arrested at a higher proportion than Asian and White females respectively. Additionally, Hispanic females get arrested at a greatly higher rate than Hispanic males.

We also wanted to examine whether there were certain races that were being disproportionately represented in terms of arrests made when compared to the demographic proportion of that race in Hartford.

# https://www.census.gov/quickfacts/fact/table/hartfordcityconnecticut/PST045222
# We used the census data on Hartford, CT from the above link to extract the proportions of each racial group that lives in Hartford, and created the following dataframe 
hartford_race_df <- data.frame(
  race = c("white", "black", "hispanic", "asian/pacific islander", "other"),
  Hartford = c(27.8, 36.4, 2.6, 45.5, 12.6))

# Calculating racial proportions from our traffic stop dataset
race_counts <- table(police$subject_race)
race_proportions <- (race_counts / sum(race_counts)) * 100
police_race <- data.frame(
  race = names(race_proportions),
  Police = as.numeric(race_proportions)
)

# Merging to create a dataframe for racial proportions of Hartford vs. racial proportions of stops made
comparison_df <- merge(
  police_race,
  hartford_race_df,
  by = "race",
  all.x = TRUE
)
comparison_df$Police <- comparison_df$Police
colnames(comparison_df)[2] ="Traffic Stops"
comparison_df_long <- pivot_longer(comparison_df, cols = c("Traffic Stops", "Hartford"), names_to = "Variable", values_to = "Value")

# Graphing
race_proportions <-ggplot(comparison_df_long, aes(x = race, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Race Proportions in Traffic Stops Made vs. Hartford Census",
    y = "Percentage",
    x = "Race"
  ) +
  scale_fill_manual(values = c("#99c140", "#cc3232")) +
  theme_minimal()

race_proportions_plotly <- ggplotly(race_proportions)
race_proportions_plotly

Based on this comparison of race data from the Census in 2016, with our data, we can discern that Hispanics are greatly over-represented in the stops made compared to the proportion of Hispanics in Hartford. This may be due to the large Hispanic population (22%) in Hartford are not citizens or do not reside in Hartford (but drive to work in Hartford), so they may be underrepresented in the traffic stop dataset.

Location of traffic stops

# Load the data
df <- stop_data

# Point-pattern map: Where stops were made (with lat/long column)
# Have the frequency of getting arrested (T/F)

# Setting up the variable
us_counties <- map_data("county")
ct_hartford <- us_counties %>%
                  subset(region == "connecticut") %>%
                  subset(subregion == "hartford")

# Removing all the observations where the district is NA
df_noDistNA <- subset(df, !is.na(df$district))

# Grabbing the observations by district and arrest made
n_arrest <- group_by(df_noDistNA, district, arrest_made) %>%
            summarize(count = n())
n_arrest <- filter(n_arrest, arrest_made == TRUE)

# Calculating the average lng and lat for each district
n_arrest_lng <- df_noDistNA %>%
                      group_by(district) %>%
                      summarize(avgLng = mean(lng))
n_arrest_lat <- df_noDistNA %>%
                      group_by(district) %>%
                      summarize(avgLat = mean(lat))

n_arrest$lng <- n_arrest_lng$avgLng
n_arrest$lat <- n_arrest_lat$avgLat
 

library(sf)
hartford_leaf <- st_read("tl_2019_09003_roads 2/tl_2019_09003_roads.shp")
## Reading layer `tl_2019_09003_roads' from data source 
##   `/Users/nat/Documents/School/College/Year 2/Fall/Stat Graphics/tl_2019_09003_roads 2/tl_2019_09003_roads.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 19929 features and 4 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -73.02372 ymin: 41.54472 xmax: -72.40944 ymax: 42.03817
## Geodetic CRS:  NAD83
ggplot(data = ct_hartford, mapping = aes(x = long, y = lat)) +
  geom_tile(aes(group = group, fill = factor(1))) +
  coord_fixed(xlim = c(-72.75, -72.625),  ylim = c(41.71, 41.825), expand = FALSE) +
  scale_fill_manual(values = c("#DDDDDD"), guide = "none") +
  geom_point(data = n_arrest, aes(x = lng, y = lat,
                                     size=count), alpha=0.5) +
  guides(fill = FALSE) +
  theme(legend.title = element_text(size = 8)) +
  xlab("Longitude") +
  ylab("Latitude") + 
  ggtitle("Point-pattern map of traffic stops in Hartford")

For illustration purposes, here’s a supplemental image of the above point-pattern map graph with an approximate overlay of Hartford’s districts.

Along with the location of traffic stops and the reasons, we created a point-pattern map to display the district with the most number of arrests. We wanted to see the simplified version of the populated area of arrest. To do so, each plotted point represents a different district in Hartford, based on the district’s estimated longitude and latitude. Based on the points in the plot, we can see that the Downtown district illustrates a bigger point, implying that there are more arrests made around Downtown. Other than Downtown district, there are some surrounding districts that might show slightly high number of arrests, but Downtown is the biggest point by far.

To further support the evidence that more people are getting arrested in Downtown, we generated a heatmap to show the frequency.

# Heat map: Where arrests were made (kriging)
# Just use the part a but make it so that it is colored by district, gradient color

ggplot(data = df_noDistNA, mapping = aes(x = lng, y = lat)) +
  stat_density2d(aes(fill = after_stat(density)), geom = "tile", contour = FALSE) +
  coord_fixed(xlim = c(-72.75, -72.625), ylim = c(41.71, 41.825), expand = FALSE) +
  theme(legend.title = element_text(size = 8)) +
  scale_fill_gradient(low="orange", high="blue") +
  xlab("Longitude") +
  ylab("Latitude") + 
  ggtitle("Heat map of traffic stops in Hartford")

For illustration purposes, here’s a supplemental image of the above heat map graph with an approximate overlay of Hartford’s districts.

Based on the heatmap, we can also see that the area near the Downtown district has a higher frequency compared to the other districts. Surprisingly, the Blue Hills district and South Meadows district also seem to be showing slightly higher density.

To further explore the reasons for traffic stops in our data set, we created this interactive map with each of the stop locations in Hartford, colored by the reasons for each stop. Each point is put onto the street of the traffic stop, which can be more easily viewed by zooming in on any particular section of Hartford.

# Setup
library(sf)
police <- stop_data
hartford_leaf <- st_read("tl_2019_09003_roads 2/tl_2019_09003_roads.shp")
## Reading layer `tl_2019_09003_roads' from data source 
##   `/Users/nat/Documents/School/College/Year 2/Fall/Stat Graphics/tl_2019_09003_roads 2/tl_2019_09003_roads.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 19929 features and 4 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -73.02372 ymin: 41.54472 xmax: -72.40944 ymax: 42.03817
## Geodetic CRS:  NAD83
incident_table <- table(police$reason_for_stop)

# Convert the table to a data frame
incident_df <- as.data.frame(incident_table)

# Rename the columns for clarity
colnames(incident_df) <- c("Incident_Type", "Count")

library(RColorBrewer)
map <- leaflet() %>%
  addTiles() %>%
  addPolygons(data = hartford_leaf)

# Check and filter out rows with missing or invalid lat/lon values
police <- police[complete.cases(police$lat, police$lng), ]

# Define a color palette based on incident_type
pal <- colorFactor(
  palette = "Set1",  # You can choose a different color palette
  domain = unique(police$reason_for_stop)
)

# Add circle markers for different reasons for arrest
map <- map %>%
  addCircleMarkers(
    data = police,
    lat = ~lat,
    lng = ~lng,
    radius = 5,  # Adjust the radius as needed
    color = ~pal(reason_for_stop),  # Use color palette based on incident_type
    fillOpacity = 0.7,
    popup = ~as.character(reason_for_stop)  # Display incident type in popups
  )

# Print the map
map

This graph shows us the distribution of stops based on reason for stop and allows us to see the locations and reason for each stop. Maple Ave, New Britain Ave, and the area around Trinity College all have a high stop amount.

Interaction between time and traffic stops

Though some may argue you are most likely to get pulled over when you have somewhere important to be, there is science behind what times you should be on the lookout for traffic cops on the road. Our group wanted to see if you are more likely to get pulled over during the morning, afternoon, or evening. To do this we will use the time stamp recorded on every police stop made and filter it into these three categories to see what time of day has the most stops. On a 12 hour clock morning is between 5:00am to 12:00pm, afternoon is from 12:00pm to 5:00pm, and evening is from 5:00pm to 5:00am.

# Setup and Date Formatting
Stat_Graphics_Police_Stop_Data <- read_excel("Stat Graphics Police Stop Data.xlsx")

Stat_Graphics_Police_Stop_Data <- Stat_Graphics_Police_Stop_Data %>%
  mutate(correct.datetime = as.POSIXct(paste(substr(date, 0, 20), substr(time, 12, 19)), format = "%Y-%m-%d %H:%M%OS"))

Stat_Graphics_Police_Stop_Data <- Stat_Graphics_Police_Stop_Data %>%
  filter(!is.na(correct.datetime))

Stat_Graphics_Police_Stop_Data$correct.time <- format(Stat_Graphics_Police_Stop_Data$correct.datetime, "%H:%M:%S")

# Creating levels
Stat_Graphics_Police_Stop_Data <- Stat_Graphics_Police_Stop_Data %>%
  mutate(hour_of_day = as.integer(substr(correct.time, 1, 2)),
         time_of_day = case_when(
           hour_of_day >= 5 & hour_of_day <= 12 ~"Morning",
         hour_of_day > 12 & hour_of_day <= 17 ~ "Afternoon",
           TRUE ~ "Night"),
         time_of_day = fct_relevel(time_of_day, "Morning", "Afternoon", "Night"))

# Graphing
morning <- Stat_Graphics_Police_Stop_Data %>%
  ggplot(aes(x = time_of_day, fill = time_of_day)) +
  geom_bar()+
  scale_fill_manual(values = c("#cc3232", "#e7b416", "#99c140"))+
  labs(title = "Time of day most Traffic Stops Occur", x = "Time of Day Stop Occured", y = "Count", fill = " ")+
  guides(fill = FALSE)+
  theme_bw()

morning_plotly = ggplotly(morning)
morning_plotly

Through the above plot you can see there is a clear difference in each time of day. Morning has the highest number of traffic stops by a wide margin followed by afternoon, and night. Unsurprisingly night-time has the least amount of people pulled over as there are fewer people driving late at night. This could likely be due to the fact that morning commuters speed when late to work, and tend to get pulled over more for it. Another possibility is that police officers coming in to start their shift in the morning are most productive early on and start to lose interest in pulling people over as their shift goes on.

# Setup
stops_per_day<- Stat_Graphics_Police_Stop_Data %>%
  group_by(date) %>%
  summarize(n_stops = n())

# Graphing
traffic_rollingavg <- stops_per_day %>%
  ggplot(aes(x = date, y = n_stops))+
  geom_line(color = "#99c140")+
  stat_rollapplyr(width= 35, align = "left")+
  labs(title = " Rolling Average Police Stops Per Day (2013 to 2016)", x = "Date of Stop (Year-Month-Day)", y = "Number of Stops")

interactive_rollingavg <- ggplotly(traffic_rollingavg)
interactive_rollingavg

The highest peak of traffic stops occurs in spring of 2015, with similar peaks occurring in late summer to early fall of 2014 and 2016. The peaks of the data are much stronger from 2013 to 2015, and die down around the latter half 2015 into 2016. There appears to be less traffic stops occurring in general from the end of 2015 into 2016. In the above graph you can see there is some element of seasonality found within the graph due to the sinusoidal nature of the rolling average line. You can see the highest peaks of traffic stops occur during the middle of the year during spring and summer time, this is likely due to a rise in road travelers during the summer. Police officers may also be more willing to get out of their car to conduct a traffic stop in warmer weather.

Conclusion

Through the report we have reached the following conclusions about how demographics, time, and location data are correlated with getting pulled over. More men than women are stopped, but they have about the same proportion of outcomes from the stop. Generally, the age group that was pulled over most were drivers aged 21-40, and the racial group that was pulled over most were Hispanics. Additionally, most stops were made in the Downtown district in Hartford, with other high-density traffic stop areas being in the Blue Hills district and the South Meadows district. For time, we also found that there was a clear difference in the number of traffic stops throughout the morning, afternoon, and evening. There was also an element of seasonality in our data, which had an effect on the stops being made throughout the years.

Although this report provides a detailed look on police stops in Hartford there were a few questions our group was not able to answer. The results we found were only for Hartford, Connecticut, so we were wondering which findings could be generalizable to other locations. For example, there may be seasonality found in Connecticut where there are cold winters, but the same result might not be found in Florida, which has a warmer climate. This dataset was also only from 2013 to 2016 and there may be different trends found within more recent data. Due to policing changes as a result of police protests in 2020, there may also be more variables recorded by officers in order to increase accountability.

Works Cited

Reinhart, Alex. “Traffic Stops in Connecticut.” CMUS&DS Data Repository - Traffic Stops in Connecticut,Carnegie Mellon University, 9 Feb. 2019, cmustatistics.github.io/data-repository/crime-and-justice/connecticut-stops.h tml.

“The Stanford Open Policing Project.” Openpolicing.Stanford.Edu, The Stanford Open Policing Project, 2023, openpolicing.stanford.edu/findings/.

Kilbane, Grace A. “Unemployment Insurance Program Letter No. 08-01.” Unemployment Insurance Benefit Payments, Employment & Training Administration (ETA) - U.S. Department of Labor, U.S Department of Labor, 3 Sept. 1991, oui.doleta.gov/dmstree/uipl/uipl2k1/uipl_0801.htm.