Aaron Gong, Ethan Wu, Kan Sun, and Steven Han

5/4/2022

Introduction

Our dataset is obtained from Kaggle and contains information on countrywide car accidents, covering 49 states of the USA. The data are collected from February 2016 to Dec 2021. Our dataset contains information about approximately 3 million car accidents and 47 variables. In our report, we are interested in analyzing the spatial distribution of car accidents, and how different road conditions and weather conditions are related to the severity of the accidents.

In our report, we are interested in the relationship between the severity of car accidents and the location, road conditions, and weather conditions of car accidents. Specifically, we will be using the following variables in our analysis:

We have also added the following variable to improve our analysis:

Since there are 7 quantative weather condition variables that we are interested in, we first perform some exploratory data analysis on them. From the histogram below, we can observe that most of precipitation, wind speed, visibility, and pressure take on value at 0, 0, 10, and 30, respectively. Humidity appears to be triangularly distributed, with its minimum value at 0, maximum value at 100, and peak value at 90. Wind chill and temeprature both appear to be normally distributed with mean 75.

Research Questions

1. What is the Spatial Distribution of Car Accidents and their Severity?

2. Is Severity Related to the Road Condition?

3. How are Weather Conditions Related to Severity?

What is the Spatial Distribution of Car Accidents and their Severity?

In this section, we want to represent each accident as a point on the map, so we modify our data to obtain the so-called point pattern data. The original data only provides a range of longitude and latitude of each accident (given by Start_Lat, End_Lat, Start_Lng, and End_Lng). In our modification, we compute Center_Lat and Center_Lng, giving us the center coordinate of the accident, which we will use to plot on the US map.

1. Accidents in Top 20 Metropolitans

What are the cities in the US with the most car accidents? We first draw the top 20 metropolitans with the most number of car accidents. We create a stacked bar plot, displaying the conditional distribution of Severity of accidents for each metropolitan.

The graph above shows the top 20 metropolitans with the most accidents. In particular, the top 3 cities are Miami, Los Angeles, and Orlando, two of which are in Florida. The number of accidents in Miami is way higher compared to any other city. In terms of severity, most of the accidents have a severity level of 2. Other levels of severity seem to take up a negligible proportion for the majority of the cities, except for Dallas, Houston, and Atlanta. Level 3 (more severe) accidents take up a relatively significant proportion of Dallas and Houston, both of which are in Texas. For Atlanta, GA, Level 3 and Level 4 accidents both take up a noticeable proportion.

2. Accidents density plot

We then create a graph that shows the density of car accidents across the US.

As shown in the graph, most car accidents happen in the metropolitan areas that we identified above. In particular, accidents are clustered in: Los Angeles (Southern California), the Bay Area (Northern California), Miami and Orlando (Florida), New York, Philadelphia, Charlotte (Northeastern part), as well as some other cities including Dallas, Houston, Portland, Chicago, Minneapolis, etc. The number of car accidents in other small cities and (rural) areas is insignificant compared to these metropolitan areas.

3. Distribution of Severity

How severe are the accidents in different parts of the US? In other words, how is the Severity variable of accidents distributed across the US? We make a plot of all accidents on the map, colored by their severity.

We can see from the plot that the majority of the accidents happen near the Western and Eastern coast, where most population reside. A small proportion of accidents happen in the middle part of the US. Also notice that, interestingly, the patterns of the points of accidents follow the main highways in the US. In terms of severity, the graph shows that there are generally more severe accidents (Level 3 or 4) in the East than in the West (most Level 2), especially Level 4 accidents. One thing worth noting is that even in the middle part of the US where relatively few accidents happen, some states do have very severe accidents. Examples are Colorado and Texas, where most accidents are of Level 3 and Level 4.

Conclusion

Through our analysis of the relationship between the severity of car accidents and the location, road condition, and weather condition, we present the following findings:

Based on our analysis, we are only able to draw some correlation between the severity of car accidents and the location, road condition, and weather condition. Yet, it is still difficult to establish any causation relationship among these variables. This is primarily due to the fact that this dataset only includes observations.

Future Work

Overall, if given more complete data, our analysis can still be improved. First, when performing our analysis, we notice that mild (level 2 severity) accidents dominate the dataset. Specifically, about 90% of the recorded accidents are characterized as mild, which creates some difficulties for our analysis. Second, there are more data collected in recent years, which can be explained by the advancement in traffic cameras or traffic sensors. This possibly implies that accidents that happen in recent years are more likely to be recorded, which inherently creates some collection bias in this dataset. With this in mind, we refrain from performing any analysis with respect to the year. To be more specific, if we want to analyze if technology advancement in driving assistance can help reduce the number of accidents, we need to first overcome this collection bias.