36-315 Final Project

Cynthia Huang, Yizhi Zhang, Yitian Hu, Juien Yang

Data Description

The “healthcare-dataset-stroke-data” is a stroke prediction dataset from Kaggle that contains 5110 observations (rows) with 12 attributes (columns). Each observation corresponds to one patient, and the attributes are variables about the health status of each patient. In particular, the categorical variables are id, gender, hypertension (yes/no), heart disease (yes/no), marital status, work type (children, government job, never worked, private, self-employer), residence type (urban, rural), smoking status (formerly smoked, never smoked, smokes), and stroke history. As for quantitative variables we have one’s age, BMI and average glucose level.

Preproess data

Data cleansing

Since the data includes NA values and unknown values, we did the following steps for data cleasing:

  1. Remove data that contains “N/A”. (remove 201 from 5109)

  2. Remove data with the gender of “Other”. (remove 1 from 5109)

  3. Remove column “ID” that is meaningless for classification.

Encoding Categorical Values

Convert categorical variables gender, ever_married, work_type, Residence_type, and smoking_status to numerical variables.

Data Sampling for Imbalanced Classification

There are 4699 patients with stroke value equal to 0, and 209 patients with stroke value equal to 1. The ratio of observations in ‘stroke’ is about 1:20, which is highly imbalanced. To resolved this problem, we undersample the majority class, the class of stroke value being 1, to achieved a balanced class distribution.

Research Questions

What attributes are associated with stroke?

Can we reduce the linear dependence among the variables and build a good regression model for predicting whether one has stroke or not?

Will classification techniques be more effective in predicting if one has stroke?

Research Question 1: What attributes are associated with stroke?

Graph 1

We start answering this question by looking at the correlations of all variables including correlations with stroke.

From the correlation plot, we can find that age and stroke are strongly positively correlated. Glucose_level, every_married, heart_disease, and hypertension are also correlated with stroke, whereas the residence_type, bmi, and gender seem to be not correlated to having stroke at all. Since this dataset is for classification, it is important to check the correlation between stroke and other variables. Using the information we get, we can decide to use which variables to predict the stroke.

Graph 2

Since we have concluded that the quantitative variables age and glucose level are correlated with stroke from the above plot, we can further evaluate the relationship of stroke and these two variables.

From the plot we can see that there are two clusters, the one at the bottom appears to full of people without stroke. The one on the top appears to have little difference for people with and without stroke. We conclude that most people who do not have stroke has lower average glucose level across all different ages.

Graph 3

Then, we will take a look at the relationship of a categorical variable: smoking status and stroke with mosaic plot.

We see that the (1,2) cell is colored in blue by Pearson residuals, thereby indicating that there is a greater number of people who formerly smoked has stroke than what we would expect under the null hypothesis that smoking status and stroke are independent. Meanwhile, the (4,2) cell is colored in red by Pearson residuals, thereby indicating that there is a fewer number of people who has unknown smoke status has stroke than we would expect under the null hypothesis. From this, we conclude that smoking status and stroke appear to be dependent

Graph 4

Then we wish to use bar plot to investigate if the categorical variables have different distributions among the population with stroke and the population without stroke. Here we look at the stacked bar plot of Hypertension conditional on Gender and facetted by whether one has stroke.

Comparing the plot for people with Stroke and people without stroke, we can see that the proportion of people with hypertension is much higher among people who have stroke than people who do not have stroke. This suggests that hypertension could be a very predictive variable for stroke. We test this hypothesis as following:

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  x
## X-squared = 24.32, df = 1, p-value = 8.159e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2075321 0.4493100
## sample estimates:
##    prop 1    prop 2 
## 0.5684211 0.2400000

Since the p-value is 8.159e-07, we reject the null hypothesis and conclude that the distribution of hypertension is significantly different among people with stroke and people without stroke.Thus, we will include hypertension when classifying stroke.

Moreover, from the stacked plot, we can see that the amount of female and male is almost the same among people with stroke and people without stroke. Thus, we can guess that there might not be a significant difference in the distribution of gender between people with stroke and people without stroke. We test this hypothesis as following:

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  x
## X-squared = 4.5857e-31, df = 1, p-value = 1
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1048642  0.1163149
## sample estimates:
##    prop 1    prop 2 
## 0.5023697 0.4966443

Since the p-value is 1, we clearly do not have enough evidence to reject the null hypothesis, so we conclude that the distribution of gender is not significantly different among people with stroke and people without stroke. Thus, we might not want to include gender when classifying stroke.

Research Question 2: Can we reduce the linear dependence among the variables and build a good regression model for predicting whether one has stroke or not?

According to CDC, someone in the United States dies of stroke every 4 minutes. With appalling facts such as this, we are motivated to take advantage of our dataset and build models that can effectively predict whether one has stroke or not.

From the previous analysis, we found out that gender, residence_type, and work_type are not likely to be very predictive. Thus, we start exploring different ways to build a model for predicting stroke without these three variables.

Graph 5

Linear regression is the first model that we wish to try. However, since we still have a total of seven variables, too many variables might cause the issue of overfitting, and correlated variables such as age and ever_married might cause the issue of multicollinearity. Thus, we should first perform PCA and see if we can use the principle component as predictors in linear regression.

After performing PCA, the graph below shows the cumulative variation explained by the first n many principal components.

From this plot, we can see that the first 5 principal components explains more than 80 percent of the variations in the data. This suggests that there are not a lot of correlation among the variables, so we cannot significantly reduce the dimension. But stil, we decided to use these 5 principal components as predictors and build a linear regression model for predicting stroke and see its performance. Below is the summary of the linear regression model built using the first five principal components.

## 
## Call:
## lm(formula = stroke ~ PC1 + PC2 + PC3 + PC4 + PC5, data = prinComps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9866 -0.3576  0.0034  0.3853  0.8748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.50000    0.02239  22.330  < 2e-16 ***
## PC1         -0.18041    0.01638 -11.014  < 2e-16 ***
## PC2          0.08318    0.02104   3.954 9.29e-05 ***
## PC3          0.03831    0.02144   1.786   0.0749 .  
## PC4         -0.04816    0.02272  -2.120   0.0347 *  
## PC5          0.00336    0.02519   0.133   0.8940    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4248 on 354 degrees of freedom
## Multiple R-squared:  0.2901, Adjusted R-squared:   0.28 
## F-statistic: 28.93 on 5 and 354 DF,  p-value: < 2.2e-16

We then computed the test accuracy of this model, which turns out to be about 73%. Thus, we expect the testing accuracy to be lower than 73%, which will not be very satisfying.

Research Question 3: Will classification techniques be more effective in predicting if one has stroke?

Since PCA did not significantly reduce the input variables’ dimension and even the training accuracy from the linear regression model was not ideal, we think that a linear separator might not be enough in classifying stroke. Therefore, we decide to experiment with classification.

Graph 6

We first wish to see if our dataset form clusters, since, if so, we can then confidently use classification techniques to draw decision boundaries. Therefore, we first make a dendrogram on the explanatory variables.

We used complete linkage and k=10 clusters to plot the dendrogram. We then color the leaf labels with red and blue representing this patient had a stroke before and didn’t have a stroke before. From this dendrogram we have that most of the clusters have either stroke or no stroke dominating. Most clusters align well with stroke in the dataset. For example, the three clusters with colors pink, orange and dark grey green on the left are dominated by no stroke, stroke and no stroke respectively. However, the one cluster on the right contains data from both stroke and no stroke with no clear domination. There is another light blue cluster in the middle contains data mostly from stroke but some from no stroke as well. Therefore, we conclude that most clusters align well with stroke in the dataset but there are some exceptions.

Graph 7

Finally we want to figure whether we could use classification tree to classify our data.

We do a random sampling of the data and results in 418 patients. We then create a train set and a test set.

The above decision tree is the rule that the model generated to classify data. We start from the root node:

  • At the top, it is the overall probability of having stroke. We have 50 percent of patients not having stroke.
  • The root node asks whether the age is less than 57. If it does, then the probability of having stroke is 18 percent.
  • If it doesn’t, then we go down to the root’s right child node (depth 2). 54 percent are patients with age larger or equal to 57, and the probability of having stroke is 0.76.
  • We keep on going down like that to understand how features impact the likelihood of having stroke.

We tried containing different variables and found that the accuracy of the classifier that contains work_type is higher. Since the statistical tests we used only considered linearly dependency, but this classification tree does not necessary use the linear relationship so we include work_type.

Graph 8

The above confusion matrix describes the performance of our decision tree classifier on the set of test data. The first column of this matrix considers patients without stroke: 36 patients were correctly classified as 0 (True Negative), while the other 10 patients were wrongly classified as 1 (False positive). The second column considers patients with stroke, the positive class was 30 (True positive), while the negative class was 8 (False negative). Then we can compute the accuracy by \(accuracy = \frac{TP+TN}{TP+TN+FP+FN}\), which is 78.6 percent. Other rates that evaluates our model such as recall and precision are also presented in the plot. Therefore, we get a classifier using decision tree model with accuracy of 78.6%.

Conclusion

We first explored which variables are correlated with ‘stroke’ using the correlation plot, mosaic plots, and barplots. We found that ‘Residence_type’, ‘work_type’, and ‘gender’ are not very correlated with ‘stroke’, so we decided to exclude them when constructing models for classifying ‘stroke’.

For building the classifier, we first tried PCA with linear regression. Out of seven variables, we need the first five principal components to explain a total of 80 percent of the variation of the data. Since we failed to explain the total variation using less than 3 principal components, we conclude that our features either have non-linear relationships or low degree of dependence. The training accuracy of the linear model we built using first five principal components is 73 percent, which was not satisfying.

Next, we chose to use clustering techniques for classification. We built a dendrogram using complete-linkage with k being 10, and we colored the leaves by ‘stroke’. We found that most clusters align well with stroke in the dataset but there are some exceptions. Most of the clusters like the three clusters on the left have either stroke or no stroke dominating while the cluster on the right has no clear domination of stroke.

Finally, we used decision tree to classify our data. The model generated continuous rules to classify data. Rules include whether age is less than 57, whether average_glucose_level is less than 96, whether bmi is lower than 39. Each leaf it produces corresponds to a probability of having or not having stroke. Then we test the performance of our decision tree by building the confusion matrix. With predicted and actual results, we get the accuracy of our classifier using decision tree model is 78.6%.

Therefore, decision tree seems to be the kind of model suitable for predicting whether one has stroke or not.