Introduction

In an increasingly digital world, organizations maintain the privacy and integrity of their data through safeguards. However, those defenses may not be enough to protect against more advanced threats, leading to data breaches. A data breach is defined as any security incident that results in unauthorized access to confidential information. In Washington state, any data breach that results in the compromise of the personal information of more than \(500\) Washington residents is required to be reported to the Attorney Generals Office (AGO). The breaches are publicly available via Data Breach Notifications Affecting Washington Residents (https://data.wa.gov/Consumer-Protection/Data-Breach-Notifications-Affecting-Washington-Res/sb4j-ca4h) provided by the Washington State AGO Consumer Protection Division, which consists of statistics derived from the mandatory notices. The dataset includes data from August 2015 to the present and is updated daily, but we will be using data as of April 17, 2023.

## [1] 833  24

##  [1] "DateAware"                         "DateSubmitted"                    
##  [3] "DataBreachCause"                   "DateStart"                        
##  [5] "DateEnd"                           "Name"                             
##  [7] "Id"                                "CyberattackType"                  
##  [9] "WashingtoniansAffected"            "IndustryType"                     
## [11] "BusinessType"                      "Year"                             
## [13] "YearText"                          "WashingtoniansAffectedRange"      
## [15] "BreachLifecycleRange"              "DaysToContainBreach"              
## [17] "DaysToIdentifyBreach"              "DaysBreachLifecycle"              
## [19] "DiscoveredInProgress"              "DaysOfExposure"                   
## [21] "DaysElapsedBetweenEndAndDiscovery" "EndedOnDayDiscovered"             
## [23] "DaysElapsedBeforeNotification"     "DaysOfExposureRange"

In the dataset, there are \(833\) rows, with each representing a data breach that occurred and was reported to the AGO, and \(24\) columns, denoting the notable categorical and quantitative variables. The variables are listed and defined as follows:

About the entity
- Id - The associated unique key
- Name - The name of the entity notifying the AGO
- IndustryType - The industry type of the entity (Business, Education, Finance, Government, Health, and Nonprofit or Charity)
- BusinessType - The subcategories of entities who are the IndustryType of Business (Accessories, Biotech, Cleaning, Clothing, Construction, Consumable, Cosmetic, Cryptocurrency, Entertainment, Fitness, Home, Hospitality, Human Resources, Legal, Manufacturing, Professional Services, Real Estate, Retail, Shipping, Software, Telecommunications, Transportation, Web Services, other)
About the time of occurrence
- DateStart - The known or approximated start date of the data breach
- DateEnd - The known or approximated end date of the data breach
- DateAware - The date the entity became aware that a breach impacting Washington residents had occurred
- DateSubmitted - The date the entity submitted their notice to the AGO
- Year - The year in which a notice is submitted, with year being defined as starting on July \(24\) and ending on July \(23\) of the following year
- YearText - The Year variable in text
About the response
- DaysToIdentifyBreach - The total number of days it takes an entity to discover that a breach of consumer data has occurred after the breach has begun
- DaysToContainBreach - The total number of days it takes an entity to end the exposure of consumer data, after discovering the breach (if a breach ends before it is discovered, this column will be marked as \(0\))
- DaysBreachLifecycle - The lifecycle of a breach measured in days (the sum of DaysToIdentifyBreach and DaysToContainBreach)
- BreachLifecycleRange - Same as above but put into buckets
- DaysOfExposure - The total number of days that consumers information was exposed by a breach
- DiscoveredInProgress - Whether or not an entity discovered a breach while it was still in progress
- DaysElapsedBetweenEndAndDiscovery - The total number of days that elapsed after the exposure of consumers data ended, and when the notifying entity actually learned of the breach
- EndedOnDayDiscovered - Whether or not the exposure of consumers information ended on the same day that the notifying entity discovered the breach
- DaysElapsedBeforeNotification - The number of days that elapsed before the notifying entity submitted notice to the AGO
- DaysOfExposureRange - Range of DaysOfExposure
About the cause
- DataBreachCause - The cause of the breach (Cyberattack, Unauthorized access, Theft or mistake)
- CyberattackType - The subcategory of breaches caused by cyberattacks (Malware, Ransomware, Phishing, Skimmers, Other, Unknown)
About the effect
- WashingtoniansAffected - The known or approximated number of Washington residents whose information was affected by the data breach
- WashingtoniansAffectedRange - The range of Washington residents affected

In order to prevent further breaches of data, their characteristics must be understood. Throughout the report, we hope to answer the following questions:

Are data breaches less easily detectable when started by certain causes?
Are certain entities better at responding to data breaches than others? Have their responses improved over time?
Are certain entities more susceptible or more heavily impacted by data breaches?
Have data breaches been more common and/or more catastrophic in the last eight years?

Causes

In order to understand data breaches, the causes behind them must be acknowledged. Certain causes may lead to less discernible breaches, which can be determined through the variable of DataBreachCause. Detectability will be defined as the number of days required to identify the breach. This can be accomplished with the variable of DaysToIdentifyBreach.

When DaysToIdentifyBreach is plotted against DataBreachCause in a side-by-side boxplot, it is evident that there are many outliers, mainly in the higher end, for all of the various causes, and it is worth noting that these may skew the data. Nonetheless, it seems that data breaches resulting from thefts or mistakes (Theft or Mistake) are the most easily identified since it takes fewer days to identify them, which can be viewed through the smaller median apparent in the plot and the overall smaller range of the boxplot. Data breaches caused by cyberattacks (Cyberattack) seem to be skewed slightly to the left and seem to be the least identifiable given that the second quartile is higher than the rest of the categories, denoting that it takes more days to identify. In contrast, breaches from unauthorized access (Unauthorized Access) are skewed to the right and seem to be middling in terms of the time to identify (i.e., the median is greater than that of Theft or Mistake but lesser than that of Cyberattack).

From this, we suspect that some causes are less detectable than others. In particular, we suspect that cyberattacks are not as noticeable, while breaches initiated by thefts or mistakes are more distinguishable. However, as seen in our ANOVA test results below, we are returned with a p-value greater than \(0.05\), so we do not reject the null hypothesis that the average number of days to identify the breach are the same among the causes. In other words, we do not have sufficient evidence that breaches caused by cyberattacks are less noticeable than others.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  DaysToIdentifyBreach by DataBreachCause
## Bartlett's K-squared = 10.336, df = 2, p-value = 0.005696

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  DaysToIdentifyBreach and DataBreachCause
## F = 1.6694, num df = 2.00, denom df = 120.11, p-value = 0.1927

It is also worth focusing on data breaches that were identified after they ended. Breaches that were discovered after completion can be possibly defined as successful since they were unexposed in the duration of their run. Hence, these may be particularly important in the exploration of the discrete nature of data breaches and their causes. The time of identification of this type of completed breach is encapsulated in the variable DaysElapsedBetweenEndAndDiscovery.

From the plot, it can be determined that data breaches based on thefts or mistakes are still the easiest to identify even for completed breaches. However, it seems that breaches resulting from unauthorized access take the longest to discover in this case (compared to the previous plot) due to the slightly higher median in the plot. There are also outliers toward the higher end in this plot that may influence the data.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  DaysElapsedBetweenEndAndDiscovery by DataBreachCause
## Bartlett's K-squared = 55.409, df = 2, p-value = 9.294e-13

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  DaysElapsedBetweenEndAndDiscovery and DataBreachCause
## F = 2.4959, num df = 2.000, denom df = 48.348, p-value = 0.09299

However, as was the case before, our statistical ANOVA test returns a p-value greater than \(0.05\), so we fail to reject the null hypothesis that the average number of days to identify the breach after it ended is the same among the causes. In other words, we do not have sufficient evidence that breaches caused by cyberattacks or unauthorized access are less easily detected than others after the breach has ended.

Ultimately, while we might visually observe that the distinct causes of data breaches are correlated with the length of identification (specifically, that Theft or Mistake breaches are more easily discovered and breaches by cyberattacks are the least easily identified), our statistical ANOVA tests caution against making claims that these differences are significant.

Response

We will next examine the responses to data breaches by the respective entities. An effective response will be defined as a breach that ended on the day discovered, which is the variable EndedOnDayDiscovered. To compare the responses, the variable IndustryType will be utilized to differentiate between the entities.

The mosaic plot shows that when we compare the industry type and if the breach ended on the day of discovery for all of the observations in our sample, the only industry that appears to have a proportion greater than expected is the health industry. That is, we can see that the health industry appeared to have had a better response to data breaches, as there were more breaches that ended on the day they were discovered than we would have expected, given the proportions in the other industries. Overall, from this visualization, we can conclude that compared to the other industries, the health industry was the best at responding to data breaches.

From the graph of regression lines for each of the industries, we can see that the government, health, and business industries exhibit a trend of decreasing data breach length across the years. This tells us that as time has passed, the lifespan of the data breaches has been shortened. On the other hand, the education and non-profit/charity industries have not had as much of a noticeable decrease in the length of their data breaches. The finance industry showed a weak decrease, essentially in between the slopes of the government, health, and business industries and the education and non-profit/charity industries. Given that a decrease in the lifespan of data breaches implies an improvement in responses to data breaches, we can conclude from this visualization that the government, health, and business industries have shown improvement in their responses to data breaches over time.

It can be reassuring to know that for some industries, data breach responses have advanced. However, we have also discovered certain industries that are lacking in terms of their response, which can be practical in finding solutions to increase security. Given that the health industry seems to be the most responsive, it may be worthwhile to further examine what protections are in place for that particular industry.

Impact

To better understand the distribution of data breaches, we look at the number of data breaches by the type of industry that the notifying entity belongs to. This could be done by examining the distribution of the variable IndustryType, which records the type of industry that the notifying entity belongs to.

In the plot above, we see a bar plot demonstrating the distribution of types of industries. We can see that the industry type with the most number of data breaches is the Business industry, and Government related industries contribute the least number of data breaches. Some natural questions following the above observation might be whether Business industries are more susceptible to data breaches, and whether Government related industries are much less susceptible to data breaches. If assuming there are similar numbers of industries for each type of industry in Washington, then we may be able to formally test the above questions. However, we only have data about the number of notifying entities for each industry type, but we dont have data about the total number of industries for each industry type in Washington, so we could not answer the above questions. With future works, we might be able to collect the data needed to answer the above questions and discuss them in greater depth.

We want to see how impactful data breaches are given that there are many types of notifying entities. Therefore, we examine how the number of Washingtonians affected is associated with the industry types that the notifying entity belongs to. This can be done by discussing the two variables WashingtoniansAffected, which records the number of Washingtonians affected by data breaches, and IndustryType, which is discussed above.

We will first look at the mean WashingtoniansAffected for each of the industry types.

We can see that government related industries seem to have the greatest number of Washingtonians affected on average, while Business industries seem to have the least number of Washingtonians affected on average.

In the above Histograms, we can see the distributions of log transformed WashingtoniansAffected for different industry types. One observation might be that the distributions of WashingtoniansAffected for all types of industries are very similar, where all distributions are heavily skewed. We can see that the data breaches of government related industries have log transformed WashingtoniansAffected less concentrated at the left (small values) than other industries, which may indicate that the variance of WashingtoniansAffected may not be equal for all industry types. In addition, we can see that the histograms for Government and Non-Profit/Charity industries have longer tails than other industries, which may indicate that data breaches of these industries affect more Washingtonians more easily than other industries.

To formally test our observations, well first test whether WashingtoniansAffected has the same variance across different industry types using Bartletts test.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  WashingtoniansAffected by IndustryType
## Bartlett's K-squared = 394.88, df = 5, p-value < 2.2e-16

We see that the p-value is very small and less than 0.05, so we reject the null hypothesis that the variance of WashingtoniansAffected is the same across all industry types and conclude that there exists at least one pair of industry types that have different variances of WashingtoniansAffected. Therefore, we should not assume that variances are equal when performing the following anova test.

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  WashingtoniansAffected and IndustryType
## F = 0.99027, num df = 5.00, denom df = 177.55, p-value = 0.4252

From the results of the one-way ANOVA test for whether WashingtoniansAffected is the same across industry types, we can see that the p-value is \(0.4251\), which is greater than the alpha of \(0.05\), so we fail to reject the null hypothesis that the mean of WashingtoniansAffected is the same across industry types.

Data Breaches Over Time

To explore whether or not data breaches have been more common and/or catastrophic in the last eight years, we can observe the frequency of reported breaches and the number of affected Washingtonians in each year from \(2016\) to \(2023\).

The plot on the top shows the number of reported breaches in each year, colored by what caused the breach. We see that the distribution has one mode, and is skewed left. More specifically, we see a spike in data breaches in \(2021\), most of them being caused by cyberattacks. While this number decreased in \(2022\) and \(2023\), the observed frequency remains higher than years prior to \(2021\). In addition, we note that the \(2023\) bin only includes observations up until the present (April \(17\)), meaning it does not include a full years worth of data, so we might actually expect the count to be higher by the end of the year.

Looking at the conditional distribution of DataBreachCause given the year, we see that Cyberattack have always been the most common cause of breaches in each year from \(2016\), followed by Unauthorized Access, then Theft or Mistake. However, the proportion of data breaches caused by cyberattacks increased dramatically in \(2021\), going from approximately \(66\)% of the breaches in \(2020\) to approximately \(80\)% of the breaches in \(2021\). This may be because of the increased reliance on digital platforms and resources following the start of the COVID-\(19\) pandemic, causing many online systems to be the target of malicious attempts to access data.

The plot on the bottom shows the number of Washingtonians affected in each breach in each year and is also colored by the cause of the breach. We might be interested in seeing whether the impact of the data breaches are significantly different across the years, so we can do a formal ANOVA test.

First, well test whether WashingtoniansAffected has the same variance across different years using Bartletts test.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  WashingtoniansAffected by Year
## Bartlett's K-squared = 617.54, df = 7, p-value < 2.2e-16

We see that the p-value is very small and less than \(0.05\), so we reject the null hypothesis that the variance of WashingtoniansAffected is the same across all Year values. In other words, there is at least one pair of years that differ in terms of their variance of WashingtoniansAffected.

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  WashingtoniansAffected and Year
## F = 2.0487, num df = 7.00, denom df = 206.94, p-value = 0.05062

From this one-way ANOVA test, we see that the p-value is quite small but is just above the alpha value of \(0.05\). Therefore, we do not reject the null hypothesis that the mean of WashingtoniansAffected is the same across all years.

We make the above conclusion with caution since the p-value is almost exactly equal to our significance value of \(0.05\). However, in the case that there is a significant difference, we might also be interested in seeing whether the average impact of data breaches has increased throughout the years. To do this, we will create a filtered dataset that excludes data from \(2023\) (since we do not yet have a full year of data) and rows where the number of Washingtonians Affected is not reported, and then we can create the graph below:

As seen from our tests and the smoothed linear regression line of the average of WashingtoniansAffected across years, the average number of people affected did not increase.

Overall, while the frequency of data breaches seem to be higher in recent years, they have not particularly been more catastrophic in terms of how many people they affected.

Conclusions and Words to Future Researchers

In this study, we explored a dataset about data breaches that have occurred and affected more than \(500\) residents in Washington. With the interest of investigating the impact, detectability, and resolvability of data breaches given factors such as the cause of the data breach and the industry of the affected entity, we produced a set of statistical visualizations and tests.

Firstly, in assessing the overall situation, we found that there was interestingly a spike in data breaches that were reported in 2021, which we suspect to be caused by the increased reliance on digital data tools and storage during the COVID-19 pandemic. This hypothesis is supported by the observation that this spike was driven by a large increase in Cyberattacks. Although there was a spike in data breaches, its impact (measured by the number of Washington residents affected) did not vary significantly across the years. In terms of detectability, data breaches that were caused by Theft or Mistake were quicker to be detected than breaches caused by Cyberattacks or Unauthorized Access regardless of whether or not the breach had already ended (though the difference we observed is not statistically significant based on our ANOVA tests). In addition, breaches that occurred to entities within the business industry overall affected many more Washington residents compared to other entities, but this may be because there are more business entities in the first place. In fact, when we look at the distribution of Washingtonians affected conditional on the industry, we find that there does not appear to be any significant difference in the impact within different industries, as is also proved by our statistical test. Lastly, we found that Health entities were particularly effective at responding to data breaches after discovery, and Government and Business entities have also improved in the last few years. Therefore, it may be useful to study what safeguards they have in place.

This information can potentially help Washington state officials and entities identify where funding or efforts could go to reduce the frequency and impact of data breaches. For instance, from our analysis, one takeaway is to focus more on how cyberattacks can be prevented, seeing that a certain data breach cause has become more common in recent years. It may also be useful to identify how cyberattacks could be detected more easily if they do occur since we found that it typically is not detected as quickly as breaches caused by Theft or Mistake.

Although our dataset provided a lot of quantitative data about the nature, cause, and effect of data breaches in the last few years, further recommendations about this problem could be supported with qualitative data about the specific circumstances under which the breaches occurred and how exactly they were resolved. Therefore, some potential questions that might be interesting to explore in the future would be if certain solutions have been more or less effective at detecting and resolving data breach incidents. Similar questions include if data breaches were less frequent or catastrophic among entities with higher security scores for their data management systems, or what kind of data was exposed in the breaches. Lastly, we would like to note that our dataset was limited to data breaches that occurred in the state of Washington, so performing similar analyses on a dataset of breaches spanning many regions across the United States might yield more insights. Answers to these questions and limitations could give us further insight as to what can be done or what we need to watch out for going forward to maintain safety and security online.

Investigating Data Breaches in Washington State

36-315: Statistical Graphics and Visualization