Introduction and Dataset
Earthquakes have been one of nature’s most deadly forces across time.
Our goal is to analyze earthquake data from the past 20+ years in order
to provide insights about earthquake trends around the globe and over
time.
We use an earthquake dataset from Kaggle that can be found here: https://www.kaggle.com/datasets/warcoder/earthquake-dataset;
it records information on about 782 earthquakes from 2001 to 2023. It
includes 19 columns of variables, as listed below, with the definitions
coming directly from Kaggle:
title
: Title name given to the earthquake
magnitude
: The magnitude of the earthquake
date_time
: Date and time
cdi
: The maximum reported intensity for the event
range
mmi
: The maximum estimated instrumental intensity for
the event
alert
: The alert level - “green”, “yellow”, “orange”,
and “red”
tsunami
: “1” for events in oceanic regions and “0”
otherwise
sig
: A number describing how significant the event is.
Larger numbers indicate a more significant event. This value is
determined on a number of factors, including: magnitude, maximum MMI,
felt reports, and estimated impact
net
: The ID of a data contributor. Identifies the
network considered to be the preferred source of information for this
event.
nst
: The total number of seismic stations used to
determine earthquake location.
dmin
: Horizontal distance from the epicenter to the
nearest station
gap
: The largest azimuthal gap between azimuthally
adjacent stations (in degrees). In general, the smaller this number, the
more reliable is the calculated horizontal position of the earthquake.
Earthquake locations in which the azimuthal gap exceeds 180 degrees
typically have large location and depth uncertainties
magType
: The method or algorithm used to calculate the
preferred magnitude for the event
depth
: The depth where the earthquake begins to
rupture
latitude / longitude
: coordinate system by means of
which the position or location of any place on Earth’s surface can be
determined and described
location
: Location within the country
continent
: Continent of the earthquake hit country
country
: Affected country
Main Questions
Over the course of this report, we seek to explore three main
questions about earthquake occurrences. First, we will investigate how
earthquake frequency and magnitude vary globally, and if the two peak in
the same locations. Next, we will describe how the number of seismic
stations used to detect earthquake locations vary with other earthquake
characteristics. Finally, we will analyze how earthquakes have changed
over time. By answering these questions, we hope to gain and provide a
better understanding of earthquake trends.
Analysis
Question 1: Magnitude
Our first question aims to analyze earthquake frequency and magnitude
to gauge any patterns. To do that, we begin with examining a histogram
of earthquake magnitudes as an initial exploratory EDA step.

Here we can see that the histogram of earthquake magnitudes is
unimodal and skewed to the right, with the mode at roughly 6.75. Though
we know that there have been lower magnitude earthquakes in the past 20
years, it appears that this dataset only reports those of magnitude
larger than 6.5.
To understand the variation in magnitude between various locations in
the world we analyze violin plots of earthquake magnitude levels.

From the plot, the variance of magnitude greatly varies among
different locations. For instance, Africa and Europe have magnitude
values between 6.5 and 7.0 which are lower values; however, the
frequency of earthquakes in these locations are generally higher. Asia
and North America have a higher range of magnitude levels between 6.5 to
7.8 and the interquartile range is also higher, indicating high
frequency levels. South America has the highest range in magnitude
between 6.5 and 8.7 (extremely intense earthquakes); however, the
interquartile range is on the lower end, so these earthquakes are not as
commonly occurring.
To elaborate further on this question, we use a map plot to better
understand magnitude levels and frequency in various locations.

The plot above indicates that coasts experience earthquakes that are
generally higher in magnitude levels. The western coast of South America
and several regions in Asia experience more earthquakes with high
intensity levels, while Africa and Australia experience earthquakes with
lower magnitude levels. Earthquakes are also more concentrated in the
coasts and appear to be more spread out in the interior regions in
various locations. This indicates that there could be a relationship
between earthquake levels and effects from the ocean. It could be
interesting in a future analysis to draw the lines of tectonic plate
boundaries.
Following this potential association, we check more specifically on
the relationship between earthquake magnitudes and tsunamis. To do this,
we produce histograms of magnitudes of earthquakes faceted on whether or
not there was a tsunami.

There are more observations that do not cause tsunamis, but both
histograms have distributions that are relatively similar to the overall
distribution. To determine if there is a statistically significant
difference, we also run a two sample t-test comparing means of the
magnitudes of the groups split on whether the quake causes a tsunami.
With a t-score of -0.13 and a p-value of 0.8931, we fail to reject the
null that there is no significant difference between the means of the
two groups.
Question 2: Seismic Stations
In our second question, we seek to understand which earthquake
variables are associated with the number of seismic stations used to
detect an earthquake. As an initial step, we plot the number of seismic
stations for each earthquake event on a map of the world. This provides
a broad overview of the variable, both in terms of its overall range as
well as potential explanations behind its fluctuation (e.g. country
preference or income-level, remoteness of the earthquake, etc).

We see that the number of stations has a large overall range, with
some earthquakes seemingly detected with zero seismic stations and some
detected with hundreds. There doesn’t seem to be a clear geographic
pattern in the number of stations used, though there are quite a few
zero-station observations in Alaska and off of Antarctica, and more
many-station observations in Japan (perhaps indicating some sort of
population-density relationship). The zero-station observations are
curious: perhaps these earthquakes are so strong that seismologists
don’t need formal stations to identify their origin, or they are so
small/affect so few people that they don’t merit rigorous precision. A
third possibility is simply that the data is imperfect and that the
zeros are representative of missing records.
Next, we delve deeper and look at potential relationships with all
other variables in a two-dimensional PCA plot. Here, we are seeking to
identify how the variation in the dataset is distributed and which
variables are associated with a different number of seismic stations.
After computing Principal Component Analysis, we display the resulting
Scree Plot:

From looking at the Scree Plot, we see that the variability in the
data set is quite spread out across the first 5 dimensions, so we
suggest an analysis that uses k = 5 components or less. This
recommendation is also made because we notice that there is a large
elbow in the plot at the 5th dimension. Our analysis makes use of the
first two dimensions and is visualized using a PCA Biplot of Dimension 1
vs Dimension 2, which account for 24.3% and 18.2% of the variation in
the data set, respectively.

From looking at the Principal Component Analysis Biplot, we are able
to see some correlations between variables and how these relate to the
groupings of the number of seismic stations. We first notice that the
groupings tend to be made on the Dim 2
axis because we see
clusters at different heights, but not really at different widths. Next,
we notice that because these variables are all pointing in relatively
similar directions, latitude
, mmi
,
magnitude
, sig
, and cdi
appear to
be positively correlated with each other, but do not seem to correlated
with any of the number of seismic station groupings since these are
pointing to the side and not up/down. Furthermore, we see that
longitude
and nst
appear to be positively
correlated, and to cement that point, longitude
clearly
points toward groups of increasingly many seismic stations.
Lastly, it’s important to note that we see interesting patterns with
the locational data. Specifically, we see that longitude is negatively
correlated with gap, indicating that as longitude increases - i.e. as we
get closer to the right side of the above map, the gap between stations
decreases. This happens because we can see that there are more stations
being used on the right side of the map. Furthermore, depth is
negatively correlated with latitude, indicating that as latitude
increases, the depth of the earthquake’s rupture decreases.
Question 3: Time Trends
Our last question surrounds how earthquakes have changed over time.
We were originally motivated to ask this question with regard to how
earthquakes affect an ever more densely-populated world, but there could
be other drivers of change too, like climate change and a heating
atmosphere.
To begin our analysis, we plotted the average yearly magnitudes of
earthquakes to see overall how magnitudes were related to time, with a
loess smoothing line and a 95% confidence band.

From the graph we see that the yearly average stays in a fairly
narrow window around 7. Even though there are slight ups and downs in
the yearly average, they are between 6.85 and 7.05, despite the total
range of earthquakes recorded being from 6.5 to 9.1. It shows that on
aggregate, magnitudes of earthquakes have remained relatively similar
over time.
We were curious about if earthquake alert levels have changed over
time due to human actions (e.g. a denser population, more stringent
building codes, etc) instead of due to magnitudes or other geological
changes. Earthquakes trigger a red, yellow, or green alert (in
decreasing order of severity) depending on their estimated fatalities
and economic losses. The levels are determined by an automated system
owned by the USGS. We ran a logistic regression on the occurrence of a
“high alert” (i.e. a red or yellow alert, which together comprise about
15% of non-NA alerts in our data set) on the date and controlled for
potential predictors of the severity of the earthquake itself. In this
way, we hoped to identify purely man-made changes in earthquake impacts
over time.
|
|
High Alert
|
|
Magnitude
|
1.583***
|
|
(0.298)
|
|
|
Tsunami
|
-0.916***
|
|
(0.282)
|
|
|
Rupture Depth
|
-0.006***
|
|
(0.002)
|
|
|
Date
|
0.0001
|
|
(0.0001)
|
|
|
|
Observations
|
415
|
|
Note:
|
***: p<0.01; **: p<0.05; *:
p<0.1
|
The regression output indicates that once we control for other
earthquake severity characteristics including magnitude, tsunami
indicator, and rupture depth, the date is not significantly associated
with the odds of a high alert. In other words, it seems that population
and infrastructure changes over the last twenty years do not have a
discernible relationship with the odds of a more devastating
earthquake.
Now, following up on our previous question, we were motivated to
determine if the number of seismic stations has changed across the years
and across continents. In order to visually explore this question, we
first created a new variable: year, which is simply the numeric year, as
stated in the variable date_time. Once this was done, we were able to
create a time series plot for each of the continents in the dataset to
see how the number of seismic stations changed over time.

From the plot, which simply took the average number of seismic
stations used per year, we see that between the years 2002 and 2013,
with some exceptions, most continents tended to use between 300 and 700
stations per year. After this time period, from 2014 to 2020, however,
we noticed an interesting result: all of the continents were using 0
seismic stations, which appears to increase from 0 again in 2021.
This interesting finding could be the result of many different
happenings. It could be the case that there is data/information missing
in this time period, as many of the earthquakes within the data set were
missing inputs for variables. Second, it is possible that this rapid
decrease in seismic station use is indicative of increased technologies
that are able to detect earthquakes and that this trend could continue
into the future, but this seems quite unlikely given that the number
dropped all the way to zero.
Recommendations for Future Work
For future work, it would be interesting to explore earthquakes of
lower magnitude as well. Because this dataset deals with earthquakes
that have 6.5 magnitude or above, we miss out on analysis of the many
low magnitude earthquakes. Recreating our graphs on earthquakes of lower
magnitudes or a dataset that includes all magnitudes would yield
interesting results.
Another avenue we could explore is how time of day affects
earthquakes. Especially with tsunamis, time of day is strongly related
to tidal tendencies, and perhaps putting more analysis into time of day
would show additional information.
Lastly, this tsunami data is very dry, only coded as 0 or 1 depending
on proximity to the coast. If we went out and found another dataset with
tsunami information on the earthquakes referenced in this dataset, we
could join the more-detailed tsunami data with the earthquake data to
conduct more nuanced analysis on tsunami/earthquake relationships.