–HTML FILE–

Description of the Dataset

The data that our group is analyzing is a dataset regarding US Accidents from 2016-2021. This dataset has 47 variables along with 2,845,342 observations of those variables. Some of these variables include severity of accident, timestamp, geographical coordinates (longitude and latitude), city, weather conditions, precipitation, presence of traffic signals, etc. Please note that we parsed the data by removing country (whole dataset just refers to US data), zip code (redundancy with city), street number (too much blank data in this column), along with 14 other variables, to prune our dataset into 30 variables and further understand the relations.

The overarching question of this project is to find out what conditions are car accidents most likely to occur, and what factors would influence their severity. The relevant variables for this question would be crash location/time, weather conditions, road environment, etc. We will explore three questions in particular. The first question is: when are car crashes typically occurring, and is severity linked to day/night? The second question is: Where are car accidents most likely to occur? Finally, the last question that we will explore is regarding what factors of the environment affect how severe car crashes are.

Research Question 1: When are car crashes typically occurring?

To answer this question, we wanted to see if they were occuring more often at a specific time of day or day of week. Thus, we created the following violin graph:

From this depiction, we can see that the daily weekday pattern resembles a bimodal distribution, where many car crashes occur during the times between 7-8 AM and 3-6 PM. This can be attributed to the morning and afternoon rush hours. Moreover, the weekends follow a peak that is earlier than the weekdays, possibly as a result of people staying at home from work. Furthermore, it appears that not too many accidents happen during the night than at day, which makes sense because most people are asleep during this time. From this, we conclude that more accidents occur during the morning and afternoon rush hours on weekdays, and more occur during the day overall than at night.

Now, we wanted to see if the season affected the number of accidents occurring. We split up the seasons into Autumn, Spring, Summer, and Winter to see if there was any difference. The graph below depicts this relationship. Note that because we had to filter by season, we only used 10,000 random points, because using more data points would have taken over several hours to loop through in order to group months into seasons.

Clearly, there are more accidents occurring during the autumn and winter than in spring and in summer. This could be possibly due to more snow/ice filled - roads during this time period, resulting in a lack of control and more accidents. This difference is great enough to conclude that there is a relationship between season and number of accidents.

Finally, we wanted to explore if there was a link between the severity of car crashes and the time of day. For instance, we theorize that there would be more severe car crashes at night than at day due to lack of sunlight. The following graph will help us answer this question. Note that higher severity levels means more severe car crashes.

We can see that our hypothesis may be correct, as the proportion of severity level 4 car crashes happen more often at the early morning times and the late night times. During the day, there are not as many level 4 severity crashes. Interestingly enough, this proportion difference does not seem to be too large, however. We will now tackle our second question.

Research Question 2: Where are car accidents most likely to occur?

The first thing we did to explore this relationship was plot the locations of all the recorded car accidents on a map of the contiguous United States.

This graph clearly demonstrates two factors, both of which will be further examined in the following figures. The first is that car accidents occur more often in population-dense areas: on the East and West coasts more so than the midwest, and near major cities. The second is that car accidents occur all along and near major highways.

In order to confirm this first observation, we sought to graph the frequency of car accidents in each state, and expected to see a graph that looks similar to that of the population in each state.

This graph confirmed our observation from the first plot, with the most populous (and densely populated) states being the darkest and the least populous states being lightest. However, there are anomalies to this pattern- there is no data from New York (the second most populous state), and some states seem to have many more accidents than would be proportional to their population- particularly California and Minnesota. We determined that this was because

In order to confirm the second observation, we created a word cloud of human-written descriptions of the accidents to see what location factors were noted.

The second most common word in these accident descriptions was “exit”, second only to “accident”. This confirms the observation that most of these accidents nationwide are occurring on or near major US highways.

Research Question 3: What factors of the environment affect how severe car crashes are?

We first wanted to consider what seems to be obvious: does precipitation affect how severe car crashes are. The following graph depicts the relationship between the amount of precipitation in inches and how severe the crashes are.

We can see that there does not appear to be too strong of a relationship between the precipitation and how severe the crashes are. However, we do note that precipitation of more than about 1.5 inches does not result in any severity level 1 crashes. We run a regression analysis on the data and get the following output.

## 
## Call:
## lm(formula = Severity ~ `Precipitation(in)`, data = u.a.prec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18111 -0.08131 -0.08131 -0.08131  1.91869 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.0813107  0.0002614 7963.46   <2e-16 ***
## `Precipitation(in)` 0.0688254  0.0030109   22.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3947 on 2295877 degrees of freedom
## Multiple R-squared:  0.0002275,  Adjusted R-squared:  0.0002271 
## F-statistic: 522.5 on 1 and 2295877 DF,  p-value: < 2.2e-16

We can see that precipitation and severity level is related to a statistically significant degree, proving our original hypothesis correct. We interpret the analysis to indicate that a one inch increase in precipitation, results in the severity level increasing by a value of 0.069.

Moreover, instead of just thinking about precipitation, we also wanted to consider the variable visibility in terms of the horizontal distance at which a person should be able to see and identify an object. We adjusted this variable by making a cutoff of 10 miles, where a visibility lower than this would be considered low visibility, and a visibility higher than this would be considered high visibility. The graph below depicts this relationship:

It stands to reason that low visibility would lead to more severe car crashes. This theory proved to be correct, as the proportion of level 4 severity car crashes with low visibility approximately triples the proportion of level 4 severity car crashes with high visibility. Because of this, we conclude that visibility plays a huge part in the severity of car crashes.

Summary of Conclusions

After exploring this dataset, we have arrived at a few conclusions we believe could be helpful in attempting to reduce the number of car crashes in the future.

First and foremost, accidents appear to be much more likely to happen near rush hours (9 am and 5 pm) on weekdays. Additionally, regardless of day we tend to see more accidents during the day compared to at night.

Our intuition would have suggested that more crashes happen at night time because of the lack of sunlight, but this is not true. The first graph suggests that the general density of crashes occur between the hours of 5 am and 7 pm. One of the reasons for this can be that more cars happen to be on the road at these hours. This should positively correlate with the amount of crashes.

Another relation that we found revealing was that the proportion of high severity crashes that happened during the early morning times as compared to late night times. Once again, we were surprised to find that severe crashes were significantly more likely to happen during the early morning than at night.

This dataset also revealed that the season of the year was a factor in predicting the number of crashes. The second graph in question 1 suggests that there is a relationship between season and number of accidents, as Winter and Autumn result in more accidents.

One of the more intuitive findings our research yielded was that more densely populated areas tended to have a higher number of crashes. The maps exploring question 2 indicate that a majority of crashes happen on the coasts – primarily major states such as California and New York. There were some areas that experienced high volumes of crashes but were not heavily populated. We deduced that these areas were still hotspots for transportation thus experiencing a number of crashes.

The final revelation from our dataset was that precipitation is related to the severity of a given crash. It is intuitive that more crashes would happen in poor weather conditions, but our exploration of question 3 also suggests that crashes are even worse in bad weather. The fact that more crashes happen more often and are worse is reason to be concerned with tackling this issue.

Things to be Answered by Future Work

While we believe our conclusions are relevant and meaningful, there are still some pressing questions that remain unanswered.

One of the greatest correlating factors we were unable to connect with this project is infrastructure funding. It would have been very interesting to see how much government spending on infrastructure relates to the likelihood of car crashes. We assume that a lack of funding would correlate with things like faulty traffic signals and potholes. These would certainly increase the overall likelihood of crashes. We would have needed additional data to answer this question.

The density of cars on the road during a particular crash is something else we would like to explore. Do crashes almost exclusively happen when there are lots of cars on the road? Are sparsely populated areas likely to witness less crashes per car? These questions are unanswerable without some data pertaining to the amount of cars on the road in given areas at given times.

The final variable we all agreed would yield interesting results was the velocity in which the car was moving prior to the crash. One of the interesting applications of having this variable would be to see if going faster is actually safer in any scenarios. Are there actually instances in which slower cars tend to crash more than faster cars? Maybe in super rural areas, it would actually make sense that the cars that crashed the most were the ones that were on the road the longest (i.e the slowest). We would have needed additional data to answer this question.