Introduction to Dataset

Our dataset is obtained from Kaggle. It contains data about disasters occurred in July 2021. The dataset contains three files: details, fatalities, and locations. Details data contain time, type and any information important to distinguish a disaster; fatalities data contain the number of people involved in these disasters - the number of people losing their lives or injured in a specific incident, where it occurred, and details about people involved in the disaster; locations data contains specific location of the disaster, such as the specific longitude and latitude of where the event occurred. All 3 files contain the same two columns episode_id and event_id so that we merged the dataset with no mistakes. The detail file also contains strings; we observed that some of the strings were not properly parsed in R, and thus we decided to delete three columns we didn’t use in this project to solve the issue with parsing strings. Here are the data we worked with:

Research Questions

Since our dataset contains climate-related data, it is meaningful to investigate how natural disasters are affected by factors such as locations, and how natural disasters affect lives of people in the United States. More specifically, we have the following research questions:

General Distribution of Data

It is always important to examine our dataset for its broader characteristics. Therefore, we first use maps to observe the general distribution of types of natural disaster in the United States during July 2021.

Examining the map, we see that the most common disasters in USA are Flash Flood, Thunderstorm Wind and Hail. Furthermore, there are more disasters that took place on the East Coast than on the West Coast. The States on the West Coast where the few disasters happened are CA, OR and WA, while almost every state on the East Coast suffered natural disasters in July. Furthermore, for locations on the East Coast, there are more disasters close to the Ocean - the points are clearly more dense on the coastline than inland.

Having examined the general distribution of disasters, we break down the dataset to investigate the research questions.

Research Question 1: Location and Scope of Thunderstorms

The first question that intrigues us is whether location impacts the scope, or magnitude, of a natural disaster. After investigating the dataset, we see that only disasters of type “marine”, “thunderstorm wind” and “marine thunderstorm wind” have magnitude data, because “magnitude” is particularly important for the weather bureau in determining the type and impact of thunderstorms, among all types of disasters. As illustrated above, Thunderstorm Wind is one of the most widely-spread disasters through the US; it impacts the entire country. Therefore, it is important to investigate whether or how magnitude of Thunderstorm Wind is impacted by its location. If there is a difference in magnitude among different areas in the US, we can increase our awareness in those more vulnerable places. We first generate a map of Thunderstorm Wind to look for general patterns:

Observing the map, we may notice some patterns. It is clear that the coastline locations tend to have more Thunderstorm wind, since the points are more dense there. However, the magnitude does not seem to be strong. We used yellow and black for clearer contrast, and we may see that the color on the East Coast of the US is not very dark. However, as we move to locations with smaller longitude, there tends to be more occurrences of Thunderstorm Wind with high magnitude. We may also see that with larger latitude, there seems to be more occurrences of thunderstorm wind with higher magnitude. To better see that trend, we may use a kriging algorithm. In this case, we would use contour map on a map with linear regression to fit the data to find pattern, and include a summary of the regression to determine whether such pattern is statistically significant.

## 
## Call:
## lm(formula = magnitude ~ lon * lat, data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.6586  -2.1559  -1.2235   0.9442  28.2802 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.976352   6.487499  -2.154   0.0313 *  
## lon          -0.644920   0.070091  -9.201  < 2e-16 ***
## lat           1.333617   0.161762   8.244 2.63e-16 ***
## lon:lat       0.012178   0.001747   6.971 3.99e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.832 on 2533 degrees of freedom
## Multiple R-squared:  0.1774, Adjusted R-squared:  0.1765 
## F-statistic: 182.1 on 3 and 2533 DF,  p-value: < 2.2e-16

Using linear regression, we may observe the effect of location to magnitude of Thunderstorm Wind. In the graph, the left bottom of the graph tend to have a higher level, while the right bottom of the graph tend to have a lower level. It indicates that with smaller longitude, the magnitude tends to be higher. However, the graph does not show much about how latitude and magnitude are related, and thus we have also done a linear regression fit on the relationship between magnitude and longitude and latitude, and the overall model is significant, with p-value < 2e-16. Thus, we may reject the null hypothesis that there is no relationship between the variables, and conclude that there is indeed a relationship between magnitude with longitude and latitude.

Research Question 2: Post-Event Damage Costs of Tornadoes

After observing the factors–longitude and latitude–that affect the magnitude of a specific type of natural disaster, it is now crucial to investigate how natural disasters cause real impacts and costs. Variables damage_crops and damage_property in our “details_c” dataset record the value of crops and properties damaged, in USD. As it only provides data of damage caused by Tornadoes (event_type=“Tornado”), we will look specifically into impacts of Tornadoes.

Variables that are essential to distinguish a Tornado are maximum path width, path length and Fujiya Scale. If the damage by a Tornado can be estimated, or even predicted, by some of these factors, it is possible to make preventions and reduce actual damage.

The above scatterplot illustrates crop damage by opacity. Most data points (Tornadoes) have path length less than 4 and maximum path width less than 200, with Fujiya Scale 0-1 (not very serious). Correspondingly, most Tornadoes have $0K-1K damaged crops, indicating a relatively small cost of damage. Several points on the top-left corner show huge values of damaged crops (100K); however, we also observe relatively large damages near path length zero. We do not see a strong correlation between path width and path length, or any factors that might impact crop damage at a glance. Similar situation is present for property damage, where no obvious trend can be observed.

To deal with possible multicollinearity and to uncover trends and clusters, we carry out Principal Component Analysis to all above variables except for Fujiya Scale. From the dataset with multiple quantitative variables, we obtain 4 principal components:

## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3359 1.0955 0.7485 0.6745
## Proportion of Variance 0.4462 0.3000 0.1401 0.1138
## Cumulative Proportion  0.4462 0.7462 0.8862 1.0000
##       PC1       PC2       PC3       PC4 
## 1.3359226 1.0954630 0.7485124 0.6745375

From the analysis result above, we see that the first principal component accounts for 44.62% of variation in the dataset, and the second accounts for 30% of the variation, etc. However, if we want the PCs to explain 95% variation, we need to take into account all four PCs. The Elbow Plot shows clearly that the elbow is at k=3, which indicates we should choose the first three principal components in order for a great amount of variation to be captured. In order to better visualize the trend and possible clusters, we plot the first two PCs grouped by Fujiya Scale:

The plot illustrates that (1) for any of the 4 variables, as it increases, PC1 seems to grow larger; (2) There seem to be clear clusters of Fujiya Scale U (small purple area on the left side), 0 and 2, with low PC1 values attributed to low Fujiya Scale Value (U and 0), and higher PC1 values attributed to Fujiya Scale 2. However, points with Fujiya Scale 1 are more widespread, and they overlap with the Scale U, 0 and 1 clusters; (3) Arrow directions show that damage to property is likely to be positively correlated with path width, and damage to crops is likely to be positively correlated with path length of Tornado. We now fit regression models to see if this is the case:

## 
## Call:
## lm(formula = damage_crops ~ tor_length, data = detail.subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.445  -2.179   0.527   1.264  85.006 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.6594     1.5690  -1.058 0.293754    
## tor_length    1.9523     0.5098   3.829 0.000272 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.75 on 72 degrees of freedom
## Multiple R-squared:  0.1692, Adjusted R-squared:  0.1577 
## F-statistic: 14.66 on 1 and 72 DF,  p-value: 0.0002717

## 
## Call:
## lm(formula = damage_property ~ tor_width, data = detail.subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2051.3  -204.2   -12.6    94.1  7147.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -170.2843   155.6301  -1.094    0.278    
## tor_width      3.8077     0.8008   4.755 9.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1021 on 72 degrees of freedom
## Multiple R-squared:  0.239,  Adjusted R-squared:  0.2284 
## F-statistic: 22.61 on 1 and 72 DF,  p-value: 9.906e-06

As shown above, both regression models inspired by the Principal Components plots have a significance level of 0.001, and p-values<0.05. Therefore, we are very certain that Tornado path width has a statistically significantly positive impact on damage to property, and Tornado length has a positive impact on damage to crops (also shown in graph), which is consistent with what we observe from the plot of PCs. Due to the limitation of data (small timespan–only consists of Tornadoes in July, and relatively small amount of data points), the model may not be the best to estimate the damage of real disasters or sufficient to predict future damage caused by Tornadoes; nevertheless, from this dataset, we do find models indicating significant relationships, and we sincerely believe it would be valuable to predict the impact of Tornadoes on properties and crops given the scale, so as to minimize loss.

It is indeed important to assess the cost of a type of natural disaster on properties and crops, but what is more crucial might be estimating the impact of natural disasters to human lives.

Research Question 3: Fatalities from Disasters

Our last research question is to identify groups of people that should be careful and the locations that can be dangerous during natural disasters. We will first examine the fatalities that occurred to identify the part of the population that should be warned, then followed by an analysis on the locations where those fatalities occurred. We can then perform proper measures in the future to ensure people’s safety and lower the fatality rate.

By exploring the plot above, we observed that the recorded number of people who died from natural disasters is more than the number of people injured. Also, more males were hurt than females during natural disasters. For the marginal distribution of age given sex, we observed that for females, the majority of people hurt are young (less than 30 years old), and most males who are hurt are aged between 30 to 50 years old. We also see that children less than 10 years old and men older than 75 died in natural disasters.

We want to further examine if there exists a correlation between gender, age, and fatality type (injury or death), which we perform Pearson’s Chi-squared test of correlation as follows:

## 
##  Pearson's Chi-squared test
## 
## data:  table(fatality$fatality_sex, fatality$fatality_type)
## X-squared = 3.4749, df = 2, p-value = 0.176
## 
##  Pearson's Chi-squared test
## 
## data:  table(fatality$fatality_age, fatality$fatality_type)
## X-squared = 26.833, df = 27, p-value = 0.4728

By performing the test, since the p-value between both age and sex with fatality type is greater than 0.05, so we fail to reject the null hypothesis that the variables are independent, thus concluding that there’s no correlation between either sex or age and fatality type. This result tells us that when natural disasters came, everyone is likely to injure or die regardless of their age or gender, so everyone should be careful as human beings are equally weak confronting nature.

This pie chart represents the distribution of the fatalities by where fatalities occurred. We observed that most fatalities occur when people are in vehicles, in water, or open areas/outside. Additionally, the plot displays the conditional distribution of natural disasters given the location. For example, we see that people are most likely to die in vehicles if they encounter a dust storm. Also, people are most likely to die in water if there were a marine tropical storm, rip current, flash flood, or thunderstorm wind. When there’s lightning, staying under the tree, golfing, and in open areas outside can be lethal. When there was a flash flood or lightning, even staying at home can be dangerous. In general, people keep alert when they stay inside as well as outside, not only when they are golfing, boating, resting under the tree but also staying in the car or at home.

Conclusion

This project focuses on analyzing the influence of factors on the scope and impact of different types of natural disasters, with the hope to better help people be more aware of the location and impact of natural disasters and, if possible, prevent the loss of disasters by planning ahead-of-time. We start with examining the magnitude of a most widespread natural disaster, followed by an analysis of possible ways to estimate cost of Tornadoes; finally, we examine the trend of injuries and fatalities caused by disasters. Examining the three questions, we may conclude:

Limitations and Future Directions

Here are indeed some limitations to our research. The dataset we are using only include data from July 2021, and thus we cannot make further conclusions and predictions about climate in the United States. In addition, the dataset is small with numerous missing values, causing our research to be limited to specific event type instead of general natural disaster. For future direction, it is possible to find better dataset that contains data in wider time span and that are more complete. In that case, we may be able to make more general conclusions and predictions about natural disasters in the United States.