According to the American Stroke Association, stroke is the 5th leading cause of death and disability in the United States. Strokes occur when blood supply to the brain is blocked and can cause long-term complications such as brain damage, paralysis, or in worst case scenarios, death. While a majority of strokes happen in people who are 65 or older, strokes can occur at any age and be life threatening. Underlying medical conditions such as heart disease, high blood pressure, or diabetes can increase one’s risk of having a stroke; therefore, understanding risk is crucial for treatment and prevention. This dataset can be used for conducting analyses such as predicting the likelihood of a patient to experience a stroke based on a variety of demographic, socioeconomic, and health related variables. The following report seeks to explore and explain some of these relationships.
Our dataset [https://www.kaggle.com/datasets/zzettrkalpakbal/full-filled-brain-stroke-dataset?resource=download] is from Kaggle and contains demographic and health information for patients who experienced a stroke and patients who have not experienced a stroke. The dataset has 4981 observations and can be used to predict the likelihood of a patient experiencing a stroke. For our report, we will be conducting analyses to investigate the relationship between these variables and the stroke variable. In total, this dataset has 12 variables (10 predictor, 1 response) and a brief description of the variables is provided below:
With this dataset, we are interested in answering the following 3 research questions:
To begin, we will conduct some univariate data analysis on our predictor variables to gain a better understanding of how our data is distributed.
The above eight pie charts show the percentages for eight categorical variables including patient’s gender, marital status, residence type, occupation, heart disease, hypertension, smoking status, and stroke. This dataset has slightly more female patients, married patients, and patients working private jobs. Of the health variables, more patients do not have the disease in question. Most patients have never smoked, followed by unknown, formerly, and lastly, smokes.
To motivate our analysis, we start by seeing if there exist any similarities or differences between individuals who experienced a stroke and those who did not. To do so, we subsetted our dataset to include only the quantitative variables, standardized the data, and then generated the dendrogram below.
Since the dataset consists of two groups (stroke and non-stroke), the dendrogram was set to have two branches to see if R could naturally detect any inherent similarities between and within the two groups and create a branch solely of individuals who experienced a stroke, and a branch of those who did not. To supplement the dendrogram, we added leaf labels where red represents stroke and black means no stroke. Since we do not see all the red leaf labels under one branch and likewise all the black leaf labels under the other branch, it seems that the ages, average glucose levels, and body mass indexes for stroke and non-stroke patients may overlap somewhat and are not necessarily vastly distinct from each other. Therefore, we will now turn to other forms of visualizations and statistical analyses to see the relationships among specific predictor variables and the stroke variable.
From the above scatter plot, we see that the relationship between average glucose level and BMI is somewhat positive for those who did and did not experience strokes. Though the trend lines are very similar, those who experienced a stroke clearly tended to have higher average glucose levels than those who did not. In addition, those who experienced strokes seem to be older on average than those who did not.
A t-test among the mean glucose levels for individuals that did and did not have a stroke produces a very low p-value (t=6.95), so we can conclude that there is a significant difference in average glucose levels between these groups.
Testing the same hypotheses with BMI instead also produces a very low p-value (t=4.77); hence we conclude that the average glucose and BMI levels are different for those who did and did not experience a stroke.
Until this point, we have largely considered the demographic variables within our dataset. Through our final research question, we are interested in seeing if the presence of certain health conditions is related to one experiencing a stroke. We begin by analyzing if stroke is influenced by an individual having heart disease or not by creating a mosaic plot shaded by Pearson Residuals.
From the above mosaic plot, we first notice that the cell corresponding to no stroke and no heart disease is not colored red or blue. There is also an unusually high number of individuals who experienced a stroke and have heart disease. We also noticed the number of patients who had a stroke and no heart disease as well as patients who did not have a stroke but have heart disease are lower than we expected. It seems reasonable to believe that having heart disease influences the likelihood of experiencing a stroke. These findings make sense as the presence of heart disease such as coronary artery disease–which causes the buildup of plaque in the arteries blocking blood flow to the brain–can result in a stroke.
Next, we are interested in seeing whether or not hypertension (a condition characterized by high blood pressure) is related to one experiencing a stroke. To analyze this relationship, we divided individuals into those who have hypertension and those who do not and made another mosaic plot shaded by Pearson Residuals.
From this mosaic plot, we observe that the cell corresponding to the presence of stroke and hypertension combined is shaded dark blue which means there are a high number of patients who experienced a stroke that also have hypertension. Additionally, there is a lower number of patients than we expect that had a stroke but were not diagnosed with hypertension. These findings make sense as high blood pressure can cause blood clots to form which can cause a stroke, so it would not be unusual to see patients who have experienced stroke to also have hypertension.
We will now wrap up our analyses by creating one final plot to visualize the distance between the individuals, coloring them by the stroke variable to see if individuals within the stroke and non-stroke groups have distances close to one another. To do so, we selected the quantitative variables age, average glucose level, and BMI, used euclidean distance for our distance matrix, and then set k = 2 for the MDS plotting to see which points are close together in the original 3 dimensional space.
Using multidimensional scaling, we can see that the groups somewhat separate themselves from each other. While there is a much greater number of individuals who did not experience a stroke, many of those who did seem to cluster together in the bottom left area of the graph, while the other larger cluster seems to consist overwhelmingly of those who did not experience a stroke.
This brain stroke prediction data includes demographic and health variables of individuals who did and did not experience a brain stroke. Using a variety of visualizations such as dendrograms, MDS plots, bar charts, box plots, scatter plots, and mosaic plots, we determined that there is a relationship between the predictor variables and the likelihood of experiencing a stroke. Namely, individuals of older ages, those diagnosed with hypertension, and those diagnosed with heart disease are more likely to belong to the stroke group. Additionally, those who experienced a stroke tended to have higher average glucose levels and BMIs than those who did not. On the other hand, we found that residence type did not correlate strongly with an individual experiencing a stroke or not.
It is worth emphasizing that this dataset has a disproportionately large number of non-stroke individuals. Therefore, we would recommend future analyses to be conducted on data that contains even more observations for people who have experienced a stroke. It would also be interesting to look into the severity of brain stroke as we were limited by our dataset in this regard. Additionally, conducting further analyses with other health variables such as cholesterol may be of interest. Overall, the brain stroke dataset provided us with valuable information on the demographics and health conditions of 4981 individuals and allowed us to explore and answer three research questions that were of interest to us.