Introduction
Our dataset is obtained from Kaggle and contains information on countrywide car accidents, covering 49 states of the USA. The data are collected from February 2016 to Dec 2021. Our dataset contains information about approximately 3 million car accidents and 47 variables. In our report, we are interested in analyzing the spatial distribution of car accidents, and how different road conditions and weather conditions are related to the severity of the accidents.
In our report, we are interested in the relationship between the severity of car accidents and the location, road conditions, and weather conditions of car accidents. Specifically, we will be using the following variables in our analysis:
Severity
: the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic. We will refer to the severity levels as: 1-Minor, 2-Mild, 3-Serious, and 4-Extreme.
Start_Lat
: the latitude in GPS coordinate of the start point.
Start_Lng
: the longitude in GPS coordinate of the start point.
End_Lat
: the latitude in GPS coordinate of the end point.
End_Lng
: the longitude in GPS coordinate of the end point.
City
: the city in address field.
Traffic_Signal
: the presence of traffic_signal in a nearby location (true, or false).
Precipitation(in)
: the precipitation amount in inches, if there is any.
Wind_Speed(mph)
: the wind speed (in miles per hour).
Visibility(mi)
: the visibility (in miles).
Pressure(in)
: the air pressure (in inches).
Humidity(%)
: the humidity (in percentage).
Wind_Chill(F)
: the wind chill (in Fahrenheit).
Temperature(F)
: the temperature (in Fahrenheit).
Weather_Condition
: the weather condition (rain, snow, thunderstorm, fog, etc.).
We have also added the following variable to improve our analysis:
Severity_Chr
: We aim to change the severity of the accident from numbers to characters. Specifically, we constructed this variable based on Severity
, and we made “Minor” corresponds to 1, “Mild” corresponds to 2, “Serious” corresponds to 3, and “Extreme” corresponds to 4.
Center_Lat
: the latitude of the center of the accident, which is obtained by taking the average of Start_Lat
and End_Lat
.
Center_Lng
: the longitude of the center of the accident, which is obtained by taking the average of Start_Lng
and End_Lng
.
Since there are 7 quantative weather condition variables that we are interested in, we first perform some exploratory data analysis on them. From the histogram below, we can observe that most of precipitation, wind speed, visibility, and pressure take on value at 0, 0, 10, and 30, respectively. Humidity appears to be triangularly distributed, with its minimum value at 0, maximum value at 100, and peak value at 90. Wind chill and temeprature both appear to be normally distributed with mean 75.
Research Questions
1. What is the Spatial Distribution of Car Accidents and their Severity?
2. Is Severity Related to the Road Condition?
3. How are Weather Conditions Related to Severity?
What is the Spatial Distribution of Car Accidents and their Severity?
In this section, we want to represent each accident as a point on the map, so we modify our data to obtain the so-called point pattern data. The original data only provides a range of longitude and latitude of each accident (given by Start_Lat
, End_Lat
, Start_Lng
, and End_Lng
). In our modification, we compute Center_Lat
and Center_Lng
, giving us the center coordinate of the accident, which we will use to plot on the US map.
1. Accidents in Top 20 Metropolitans
What are the cities in the US with the most car accidents? We first draw the top 20 metropolitans with the most number of car accidents. We create a stacked bar plot, displaying the conditional distribution of Severity
of accidents for each metropolitan.
The graph above shows the top 20 metropolitans with the most accidents. In particular, the top 3 cities are Miami, Los Angeles, and Orlando, two of which are in Florida. The number of accidents in Miami is way higher compared to any other city. In terms of severity, most of the accidents have a severity level of 2. Other levels of severity seem to take up a negligible proportion for the majority of the cities, except for Dallas, Houston, and Atlanta. Level 3 (more severe) accidents take up a relatively significant proportion of Dallas and Houston, both of which are in Texas. For Atlanta, GA, Level 3 and Level 4 accidents both take up a noticeable proportion.
2. Accidents density plot
We then create a graph that shows the density of car accidents across the US.
As shown in the graph, most car accidents happen in the metropolitan areas that we identified above. In particular, accidents are clustered in: Los Angeles (Southern California), the Bay Area (Northern California), Miami and Orlando (Florida), New York, Philadelphia, Charlotte (Northeastern part), as well as some other cities including Dallas, Houston, Portland, Chicago, Minneapolis, etc. The number of car accidents in other small cities and (rural) areas is insignificant compared to these metropolitan areas.
3. Distribution of Severity
How severe are the accidents in different parts of the US? In other words, how is the Severity
variable of accidents distributed across the US? We make a plot of all accidents on the map, colored by their severity.
We can see from the plot that the majority of the accidents happen near the Western and Eastern coast, where most population reside. A small proportion of accidents happen in the middle part of the US. Also notice that, interestingly, the patterns of the points of accidents follow the main highways in the US. In terms of severity, the graph shows that there are generally more severe accidents (Level 3 or 4) in the East than in the West (most Level 2), especially Level 4 accidents. One thing worth noting is that even in the middle part of the US where relatively few accidents happen, some states do have very severe accidents. Examples are Colorado and Texas, where most accidents are of Level 3 and Level 4.
Is Severity Related to the Road Condition?
As mentioned in the beginning, this dataset contains a number of indicators of road conditions near the accident location, such as the presence of a stop sign, a speed bump, a railway, etc. Conceivably, the presence or lack thereof of these road conditions is predictive of the severity of an accident. In this section, however, we focus on the traffic signal. This choice makes intuitive sense because people typically slow down near the traffic light or even completely stop when it is red. With decreased speed, the accident tends to be less serious or may be prevented in the first place.
According to the stacked bar plot, we notice that for the least severe accidents, nearly half of the sites have a traffic signal nearby. However, for accidents that create a greater impact on traffic, the majority of them take place where there is no traffic signal. This phenomenon is consistent across all severity levels above 1, which suggests that traffic signals might have made a difference in the prevention of serious accidents. Despite this intuition interpretation, we refrain from making an assertive causal statement, as there can be other factors in play. For instance, most parts of a road cannot be covered by any traffic signal, so naturally, traffic signals are less likely to “witness” an accident.
With that being said, we can still perform some statistical inference on the relationship between accident severity and road condition (traffic signal, in this case). This analysis can be done graphically and numerically. The mosaic plot shows the distribution, conditioned on the presence of traffic signals nearby, of accidents at different severity levels. At first sight, we can notice that the mild accidents are dominant in this dataset no matter whether there exists a traffic signal or not. Furthermore, the visualization of Pearson residuals indicates that when a traffic signal presents, we observe a significantly smaller amount of mild accidents, but a significantly larger amount of minor, serious, and extreme accidents. In the mosaic plot, the dark blue and dark red blocks suggest that the presence of traffic signals and the severity are not independent.
Finally, a formal statistical test is performed to measure the strength of evidence against the null hypothesis of independence. According to the output of the \(\chi^2\) test, the \(p\)-value is extremely close to 0, indicating unlikeness to obtain such data if accident severity and presence of traffic signal were independent. At this point, although other road conditions are not investigated, the plots and tests can be naturally extended to other binary indicators. However, since traffic signal is one of many variables related to road conditions, we can safely conclude that accident severity is related to some road conditions.
##
## Pearson's Chi-squared test
##
## data: table(df$Traffic_Signal, df$Severity)
## X-squared = 42375, df = 3, p-value < 2.2e-16
How are Weather Conditions Related to Severity?
This dataset also contains 9 quantitative variables documenting the weather condition during the car accidents, such as temperature, humidity, visibility, and wind direction. Intuitively, weather can affect driver capabilities, vehicle performance (i.e., traction, stability, and maneuverability), pavement friction, roadway infrastructure, traffic flow, and operational decisions, thus increasing the risk of car accidents. In this section, we will explore the relationship between 7 weather condition variables and the severity of car accidents.
We first perform a linear regression on the severity of car accidents against all 7 weather condition variables. In particular, we are interested in if precipitation, wind speed, visibility, pressure, humidity, wind chill, or temperature can help us predict the severity of a car accident.
##
## Call:
## lm(formula = Severity ~ Precipitation.in. + Wind_Speed.mph. +
## Visibility.mi. + Pressure.in. + Humidity... + Wind_Chill.F. +
## Temperature.F., data = acc_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57373 -0.08327 -0.06917 -0.05590 1.98759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.434e+00 7.155e-03 340.175 <2e-16 ***
## Precipitation.in. 8.896e-03 4.506e-03 1.974 0.0483 *
## Wind_Speed.mph. 1.375e-03 5.352e-05 25.696 <2e-16 ***
## Visibility.mi. 1.588e-03 1.086e-04 14.619 <2e-16 ***
## Pressure.in. -1.475e-02 2.476e-04 -59.556 <2e-16 ***
## Humidity... 7.029e-04 1.346e-05 52.220 <2e-16 ***
## Wind_Chill.F. -4.221e-03 1.301e-04 -32.437 <2e-16 ***
## Temperature.F. 4.218e-03 1.445e-04 29.192 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3829 on 2214520 degrees of freedom
## (630814 observations deleted due to missingness)
## Multiple R-squared: 0.005093, Adjusted R-squared: 0.00509
## F-statistic: 1620 on 7 and 2214520 DF, p-value: < 2.2e-16
From the model summary above, we can observe that the coefficients of all 7 predictors appear to be statistically significant. Specifically, pressure and wind chill negatively contribute to the severity of car accidents, while precipitation, wind speed, visibility, humidity, and temperature positively contribute to the severity of car accidents. However, we refrain from making any assertive statement that any of these weather variables can be effective in helping us predict the severity of car accidents. Despite the coefficients of all 7 predictors being statistically significant, their values are all very small compared to the levels of severity, which are integers from 1 to 4. Thus, some further analysis is still needed.
In the next step, we construct a stacked bar chart of Severity
and Temperature(F)
. Specifically, we are interested in if there exists any relationship between temperature and the severity of accidents.
As shown in the stacked bar chart above, we first observe that the number of minor accidents is very small across all temperatures. For all levels of severity, the number of accidents appears to increase from 0 Fahrenheit to 75 Fahrenheit, and gradually decline after 75 Fahrenheit. In particular, we can also observe that, for all levels of severity, a temperature of approximately 75 Fahrenheit has the largest number of accidents. Although we cannot claim that temperature is the cause of the accidents, it is a noteworthy result, considering the U.S. national annual average temperature (excluding Alaska and Hawaii) is 54.5 Fahrenheit. This result gives an implication for all drivers and insurance companies that we must be more cautious in the temperature of around 75 Fahrenheit when we drive.
In addition, we are also interested in if the severity of car accidents can be affected by humidity. To do this, we create a conditional density plot to visualize the conditional distribution of Severity
given Humidity(%)
.
In the conditional density plot above, we can observe that, except for minor accidents, the number of accidents of all severities appears to increase from a humidity percentage of 0% to 75%, and starts to decline after 75%. Based on our observation, 75% is likely to be the most dangerous humidity percentage range because drivers do not realize the friction between the road and the tire of their car has been lessened due to the humidity percentage. Once the humidity reaches 75% or above, drivers seem to realize the danger from humidity, and thus the number of mild, serious, and extreme accidents starts to decline. Additionally, we can observe that the number of minor accidents follows approximately a uniform distribution, but with less number of minor accidents at both ends. This can suggest that whether minor accidents happen is less likely to be affected by humidity, compared to the other three levels of severities. As an implication, we all need to drive more cautiously when we feel high humidity to avoid more serious accidents.
Furthermore, since this dataset also contains information describing the weather condition, which is stored in the variable Weather_Condition
as text, we want to further explore if accidents are more likely to happen under any specific weather condition. With this in mind, we construct a word cloud using the Weather_Condition
variable.
In the word cloud above, we can observe that the word “Overcast” appears to be the biggest and the boldest among all words, and “Light Snow” and “Snow” also appear to be relatively big and bold compared to other words. This suggests that “Overcast”, “Light Snow”, and “Snow” are the three most common words appearing in Weather_Condition
, which implies that overcast, light snow, and snow are the top three causes of accidents among all weather conditions. This result gives an implication for all drivers that we should be more careful driving when the weather is overcast, light snow, or snow.