Introduction and Dataset

Earthquakes have been one of nature’s most deadly forces across time. Our goal is to analyze earthquake data from the past 20+ years in order to provide insights about earthquake trends around the globe and over time.

We use an earthquake dataset from Kaggle that can be found here: https://www.kaggle.com/datasets/warcoder/earthquake-dataset; it records information on about 782 earthquakes from 2001 to 2023. It includes 19 columns of variables, as listed below, with the definitions coming directly from Kaggle:

title: Title name given to the earthquake

magnitude: The magnitude of the earthquake

date_time: Date and time

cdi: The maximum reported intensity for the event range

mmi: The maximum estimated instrumental intensity for the event

alert: The alert level - “green”, “yellow”, “orange”, and “red”

tsunami: “1” for events in oceanic regions and “0” otherwise

sig: A number describing how significant the event is. Larger numbers indicate a more significant event. This value is determined on a number of factors, including: magnitude, maximum MMI, felt reports, and estimated impact

net: The ID of a data contributor. Identifies the network considered to be the preferred source of information for this event.

nst: The total number of seismic stations used to determine earthquake location.

dmin: Horizontal distance from the epicenter to the nearest station

gap: The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties

magType: The method or algorithm used to calculate the preferred magnitude for the event

depth: The depth where the earthquake begins to rupture

latitude / longitude: coordinate system by means of which the position or location of any place on Earth’s surface can be determined and described

location: Location within the country

continent: Continent of the earthquake hit country

country: Affected country

Main Questions

Over the course of this report, we seek to explore three main questions about earthquake occurrences. First, we will investigate how earthquake frequency and magnitude vary globally, and if the two peak in the same locations. Next, we will describe how the number of seismic stations used to detect earthquake locations vary with other earthquake characteristics. Finally, we will analyze how earthquakes have changed over time. By answering these questions, we hope to gain and provide a better understanding of earthquake trends.

Analysis

Question 1: Magnitude

Our first question aims to analyze earthquake frequency and magnitude to gauge any patterns. To do that, we begin with examining a histogram of earthquake magnitudes as an initial exploratory EDA step.

Here we can see that the histogram of earthquake magnitudes is unimodal and skewed to the right, with the mode at roughly 6.75. Though we know that there have been lower magnitude earthquakes in the past 20 years, it appears that this dataset only reports those of magnitude larger than 6.5.

To understand the variation in magnitude between various locations in the world we analyze violin plots of earthquake magnitude levels.

From the plot, the variance of magnitude greatly varies among different locations. For instance, Africa and Europe have magnitude values between 6.5 and 7.0 which are lower values; however, the frequency of earthquakes in these locations are generally higher. Asia and North America have a higher range of magnitude levels between 6.5 to 7.8 and the interquartile range is also higher, indicating high frequency levels. South America has the highest range in magnitude between 6.5 and 8.7 (extremely intense earthquakes); however, the interquartile range is on the lower end, so these earthquakes are not as commonly occurring.

To elaborate further on this question, we use a map plot to better understand magnitude levels and frequency in various locations.

The plot above indicates that coasts experience earthquakes that are generally higher in magnitude levels. The western coast of South America and several regions in Asia experience more earthquakes with high intensity levels, while Africa and Australia experience earthquakes with lower magnitude levels. Earthquakes are also more concentrated in the coasts and appear to be more spread out in the interior regions in various locations. This indicates that there could be a relationship between earthquake levels and effects from the ocean. It could be interesting in a future analysis to draw the lines of tectonic plate boundaries.

Following this potential association, we check more specifically on the relationship between earthquake magnitudes and tsunamis. To do this, we produce histograms of magnitudes of earthquakes faceted on whether or not there was a tsunami.

There are more observations that do not cause tsunamis, but both histograms have distributions that are relatively similar to the overall distribution. To determine if there is a statistically significant difference, we also run a two sample t-test comparing means of the magnitudes of the groups split on whether the quake causes a tsunami. With a t-score of -0.13 and a p-value of 0.8931, we fail to reject the null that there is no significant difference between the means of the two groups.

Question 2: Seismic Stations

In our second question, we seek to understand which earthquake variables are associated with the number of seismic stations used to detect an earthquake. As an initial step, we plot the number of seismic stations for each earthquake event on a map of the world. This provides a broad overview of the variable, both in terms of its overall range as well as potential explanations behind its fluctuation (e.g. country preference or income-level, remoteness of the earthquake, etc).

We see that the number of stations has a large overall range, with some earthquakes seemingly detected with zero seismic stations and some detected with hundreds. There doesn’t seem to be a clear geographic pattern in the number of stations used, though there are quite a few zero-station observations in Alaska and off of Antarctica, and more many-station observations in Japan (perhaps indicating some sort of population-density relationship). The zero-station observations are curious: perhaps these earthquakes are so strong that seismologists don’t need formal stations to identify their origin, or they are so small/affect so few people that they don’t merit rigorous precision. A third possibility is simply that the data is imperfect and that the zeros are representative of missing records.

Next, we delve deeper and look at potential relationships with all other variables in a two-dimensional PCA plot. Here, we are seeking to identify how the variation in the dataset is distributed and which variables are associated with a different number of seismic stations. After computing Principal Component Analysis, we display the resulting Scree Plot:

From looking at the Scree Plot, we see that the variability in the data set is quite spread out across the first 5 dimensions, so we suggest an analysis that uses k = 5 components or less. This recommendation is also made because we notice that there is a large elbow in the plot at the 5th dimension. Our analysis makes use of the first two dimensions and is visualized using a PCA Biplot of Dimension 1 vs Dimension 2, which account for 24.3% and 18.2% of the variation in the data set, respectively.

From looking at the Principal Component Analysis Biplot, we are able to see some correlations between variables and how these relate to the groupings of the number of seismic stations. We first notice that the groupings tend to be made on the Dim 2 axis because we see clusters at different heights, but not really at different widths. Next, we notice that because these variables are all pointing in relatively similar directions, latitude, mmi, magnitude, sig, and cdi appear to be positively correlated with each other, but do not seem to correlated with any of the number of seismic station groupings since these are pointing to the side and not up/down. Furthermore, we see that longitude and nst appear to be positively correlated, and to cement that point, longitude clearly points toward groups of increasingly many seismic stations.

Lastly, it’s important to note that we see interesting patterns with the locational data. Specifically, we see that longitude is negatively correlated with gap, indicating that as longitude increases - i.e. as we get closer to the right side of the above map, the gap between stations decreases. This happens because we can see that there are more stations being used on the right side of the map. Furthermore, depth is negatively correlated with latitude, indicating that as latitude increases, the depth of the earthquake’s rupture decreases.

Conclusion

To conclude, we found that magnitudes of earthquakes have remained similar over time, and that coastal areas seem to experience higher magnitude earthquakes as compared to inland areas. The number of stations used to determine earthquake locations seems positively correlated with longitude (i.e. higher on the East Asia side of the globe and lower near Alaska/Western North America), but aren’t obviously associated with anything else. Over time, average earthquake magnitudes have remained largely stagnant while the number of seismic stations vary widely, only experiencing a period of zero-station consistency between 2015 and 2021. Earthquake devastation/alert level has not changed over time when controlling for magnitude, tsunami, and rupture depth. On the whole, this exploration has provided us with the opportunity to learn more about earthquake attributes and record-keeping, despite failing to uncover any particularly strong patterns.

Recommendations for Future Work

For future work, it would be interesting to explore earthquakes of lower magnitude as well. Because this dataset deals with earthquakes that have 6.5 magnitude or above, we miss out on analysis of the many low magnitude earthquakes. Recreating our graphs on earthquakes of lower magnitudes or a dataset that includes all magnitudes would yield interesting results.

Another avenue we could explore is how time of day affects earthquakes. Especially with tsunamis, time of day is strongly related to tidal tendencies, and perhaps putting more analysis into time of day would show additional information.

Lastly, this tsunami data is very dry, only coded as 0 or 1 depending on proximity to the coast. If we went out and found another dataset with tsunami information on the earthquakes referenced in this dataset, we could join the more-detailed tsunami data with the earthquake data to conduct more nuanced analysis on tsunami/earthquake relationships.