The dataset contains countrywide car accident data from February 2016 to December 2019, which covers 49 states of the United States. There are about 3.0 million accident records in this dataset. In this dataset, the variable we are most interested in are the time, location, severity, weather condition of the accident. This project is contributed by all members of the team.
Perform exploratory data analysis on this dataset and generate insights about car accidents in the United States.
Take a closer look into what factors affect the severity levels of the car accidents in the United States. The severity variable here indicates the impact of the accident on traffic delay, not how severe the damage was to the vehicle.
Before we begin, we take a look into our variables. First, we figure out which variables to analyze. From this step, we figure out the variables to ignore by calculating the N/A proportion of each variable and filtering out the variables with N/A proportion larger than 0.5. Through this step, we drop five variables; End_Lat, End_Lng, Number, Wind_Chill(F), and Precipitation(in).
## # A tibble: 5 x 2
## `Variables to ignore` `NA proportion`
## <chr> <dbl>
## 1 End_Lat 1
## 2 End_Lng 1
## 3 Number 0.645
## 4 Wind_Chill(F) 0.623
## 5 Precipitation(in) 0.672
Also, according to our knowledge, variables such as “ID”, “Source”,“Timezone”, and “Airport_Code” will not give much insights about traffic accidents or be useful in predicting severity levels of the accident. Thus, we can also safely ignore them as well.
The first research question we choose to investigate is the impact of weather condition on severity of car accidents. The common sense suggests that at worse weather conditions, the severity of car accidents would be greater (there would be greater impact on traffic). This research question suggests looking into weather condition variable. There are 121 weather conditions total. A lot of these weather conditions have very few records. We decided to only look at weather conditions with top 8 records exluding N/A values to reduce the complexity of the variable.
In the graph below, we display the distribution of the percentage of top 8 weather conditions by descending order faceted by 4 different severity levels in order to examine the impact of weather condition on severity of the car accidents.
Here we take a look into the percentile distribution of weather condition in each severity level. When we look at the top 8 weather conditions in each severity level, we see little to no difference in the order and proportion. It turns out that there is no sufficient evidence to support our assumption; Weather condition would have great impact on severity variable.
In order to further examine whether weather conditions have an impact on the severity of accidents, we specifically looked at visibility and wind speed, and the effect they have on severity.
It seems that high visibility has a slightly lower proprotion of severe accidents (Severity level 3 and 4), which means that accidents when there is high visibility tends to me less severe. There does not appear to be a relationship between wind speed and severity.
According to the data, California has the highest number of accidents amongst all the states by far with 663204 accidents. The next is Texas and Florida but the rest of the states have similar numbers of accidents. This could suggests that California is more prone to vehicular accidents . However the limitation to this conclusion is that the number of accidents occuring in other states may not be recorded as much as it is in California. We have to see whether the number of accidents occurring in States like California is due to certain counties within the state. This will be discussed later on.
We also compare the top 15 states with the highest number of accidents with the bottom 15 states with the least number of accidents in the bar plot above. We observe that the states with least number of traffic accidents; North Dakota, South Dakota, and Montana are expansive, sparsely populated midwestern states, compared to California and Texas, which are the most populous states in the United States.
This leads to a natural research question here. Is the number of accidents correlated to the population of states? Is it correlated to the population itself or the density of the population? In order to answer this question, we bring in the population and density data of the United States. We use state.x77 dataset from the dataset library here.
##
## Call:
## lm(formula = n_accidents ~ Population, data = state_x77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231558 -30058 -8877 13547 386796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5969.821 17841.169 0.335 0.739
## Population 12.758 2.885 4.422 5.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89950 on 47 degrees of freedom
## Multiple R-squared: 0.2938, Adjusted R-squared: 0.2788
## F-statistic: 19.55 on 1 and 47 DF, p-value: 5.755e-05
##
## Call:
## lm(formula = n_accidents ~ density, data = state_x77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69080 -53480 -24534 1536 601588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69432 18414 3.771 0.000455 ***
## density -57657 68850 -0.837 0.406590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106200 on 47 degrees of freedom
## Multiple R-squared: 0.0147, Adjusted R-squared: -0.006262
## F-statistic: 0.7013 on 1 and 47 DF, p-value: 0.4066
First, We run a simple linear regression model based on population here to predict the number of accidents. (Number of Accidents ~ Population). From the first summary above, the low p-value of the population coefficient (less than alpha level of 0.05) indicates that we have sufficient evidence to reject the null hypothesis that there is no relationship between the two variables. Thus, we find a correlation between the population of a state and the number of accidents.
In order to further investigate the previous research question, we decide to test the correlation between population density of a state and the number of accidents (second linear regression model/ summary). The population density can be easily calculated using the state.x77 dataset by dividing population over area.
Our initial assumption was that the population density of each state would affect the number of accidents, because common sense suggests that densely populated areas would have more traffic accidents. However, the data tells us that when it comes down to state-level, the population density does not vary a lot between the states, the variable not being significant enough to show correlation with the number of accidents.
## # A tibble: 49 x 2
## State Severity
## <chr> <dbl>
## 1 oklahoma 2.11
## 2 maine 2.13
## 3 north carolina 2.16
## 4 oregon 2.16
## 5 nebraska 2.17
## 6 new hampshire 2.22
## 7 louisiana 2.22
## 8 south carolina 2.23
## 9 texas 2.29
## 10 arizona 2.29
## # ... with 39 more rows
According to the data, South Dakota has the highest average severity in its accidents followed by Wyoming and North Dakota. This is interesting as these states are sparse states and not as much traffic as other states. Since severity is a measure of the impact an accident has on traffic delays, there could be an issue with the roads that is causing accidents to result in longer traffic delays.
To see if the number of accidents is higher in certain seasons than others, we did a seasonal decomposition of the number of accidents each month.
Our seasonal decomposition shows that there is clearly a seasonal trend in the number of accidents per month, with number of accidents being higher in the Fall and Winter. This could potentially be because of worse driving conditions like snow or lower temperature. What is surprising is that there seems to be an upward trend, meaning the number of accidents is rising each year. We do not believe that the number of accidents would rise this rapidly from 2016 to 2019, so we can only attribute the upward trend to errors in the data.
To see if the accidents are more severe in certain seasons, we also did a seasonal decomposition of the severity of accidents by month.
It appears that the seasonal trend of the severity of accidents is not as clear as that of the number of accidents, but it seems that the severity of accidents in Winter and Spring are higher than Summer and Fall. There is a downward trend overall, which could mean that the length of traffic delays, which is how severity is measured, is getting shorter.
In our analysis, we found out that overall weather condition did not have impact on the severity of the traffic accident. When analyzed in depth, we noticed that low visibility is associated with more severe accidents, but not wind speed. We also found out that the accidents are the most common in the most populous states such as California, Texas, and Florida. Here, we found a correlation between the population of a state and the number of accidents. We also concluded that there is a seasonal trend in both the number and the severity of accidents.