Introduction

Diabetes is a complex metabolic disorder with health implications that are difficult to overcome. In the pursuit of more insights into the factors affecting the prevalence of diabetes, we will explore the Diabetes Prediction Dataset. This dataset includes information on patient demographics, lifestyle, and clinical factors. Our objective is to delve into the relationships between various clinical indicators and demographic attributes to identify patterns seen in individuals with and without diabetes. These patterns have practical applications as they could help us identify those at risk of developing diabetes. Ultimately, this exploration is useful in enhancing patient care. In the following sections, we address three primary research questions that investigate underlying diabetes prevalence:

Research Questions:

To what extent do gender and age correlate with the prevalence of diabetes in the given population?

How do clinical indicators, including hypertension, heart disease, BMI, blood glucose levels, along with HbA1c levels, contribute to the prediction of diabetes?

To what extent does smoking history contribute to the overall risk of diabetes?

This report aims to provide valuable insights that can guide healthcare professionals and researchers in the ongoing battle against diabetes.

Summary of Data

The following are our variables of interest:

gender: Patient’s biological sex (Female, Male, Other)

age: Age of patient (range: 0 - 80)

hypertension: Presence of hypertension (elevated blood pressure) in patient (0 for No, 1 for Yes)

heart_disease: Presence of heart disease in patient (0 for No, 1 for Yes)

smoking_history: Patient’s smoking history (current, ever, former, never, not current)

bmi: Patient’s Body Mass Index

HbA1c_level: Patient’s Hemoglobin A1c level over the past 2 - 3 months

blood_glucose_level: Patient’s blood glucose level

diabetes: Presence of diabetes in patient (0 for No, 1 for Yes)

RQ1: To what extent do gender and age correlate with the prevalence of diabetes in the given population?

We wanted to learn if there exist any demographic patterns in diabetes prevalence, hence, we are interested in the gender, age, and diabetes variables. We want to know if there exists any relationships between these variables. We could use a bar plot to show the distribution of diabetes cases among different genders. A boxplot would also show the correlation between age and diabetes.

From the above plot, we can see that we have more female than male patients in this data set, however the number of females with diabetes is roughly the same as males. This indicates that males tend to have a higher proportion of individuals with diabetes than females. The main takeaway from this graph is that there could be a gender-related association with diabetes prevalence; this observation prompts further investigation into lifestyle patterns or genetic makeup that may be influencing this distinction. We should note that while some patients are labelled as Other, there was not a significant number of them in the dataset.

Turning our attention to age, we will use a boxplot to assess how age correlates with diabetes prevalence.

The box plot indicates a higher median age for individuals with diabetes (age 62) compared to those without diabetes (age 40). Moreover, the interquartile range for individuals with diabetes appears to be shifted towards higher ages. This indicates that older individuals are more likely to have diabetes than younger individuals. This could be because of lifestyle differences, which would be useful to further investigate. This finding tells us screening and preventive measures may need to be tailored based on age groups, with a heightened focus on older individuals.

Through our EDA, it is suggested that there indeed are associations between gender and diabetes, and age and diabetes. We want to test if this is true, using a chi-squared test.

For gender:

\(H_0\): There is no association between gender and diabetes status.

\(H_A\): There is a significant association between gender and diabetes status.

## 
##  Pearson's Chi-squared test
## 
## data:  gender_table
## X-squared = 143.22, df = 2, p-value < 2.2e-16

Since the p-value (<2.2e-16) is less than 0.05, we reject the null hypothesis. Hence, we have enough evidence to conclude that there is a significant association between gender and diabetes status.

In practical terms, there is a statistically significant difference in the proportions of diabetes status between different gender groups. Males tend to be at a higher risk of diabetes than females, which we saw in our EDA.

For age:

\(H_0\): There is no association between age and diabetes status.

\(H_A\): There is a significant association between age and diabetes status.

## 
##  Pearson's Chi-squared test
## 
## data:  age_table
## X-squared = 6978.4, df = 3, p-value < 2.2e-16

Since the p-value (<2.2e-16) is less than 0.05, we reject the null hypothesis. Hence, we have enough evidence to conclude that there is a significant association between age and diabetes status.

In practical terms, there is a statistically significant difference in the proportions of diabetes status between different age groups. Older age groups tend to be at higher risk of diabetes, which we saw in our EDA.

RQ2: How do clinical indicators, including hypertension, heart disease, BMI, blood glucose levels, along with HbA1c levels, contribute to the prediction of diabetes?

When it comes to medical conditions like diabetes, it is often the case that multiple factors collectively contribute to the risk of developing the condition. We chose to analyze multiple clinical factors together and examine how they collectively are related to diabetes in order to gain a more nuanced understanding of how an individual’s risk for diabetes is influenced by various clinical factors.

From the above heatmaps, it seems that blood_glucose_level and HbA1c_level are significant predictors of diabetes. For patients with the same bmi, diabetic patients seem to have noticeabley higher blood glucose and HbA1c levels. Additionally, the heatmaps for blood_glucose_level versus HbA1c_levels seems to suggest there is a strong relationship between a patient having diabetes and both of these levels being elevated, indicated by the higher intensity of red in this graph. The lack of outliers in this graph might also suggest a relationship between blood glucose level and HbA1c level as well.

The first set of scatterplots shows how blood_glucose_level and bmi are related based on whether a patient has hypertension and heart diseases for both diabetic and non-diabetic patients, while the second shows this relationship for HbA1c_level and bmi. In both these cases we see that diabetic patients show significantly higher blood_glucose_level and HbA1c_level when compared to non-diabetic patients. There doesn’t seem to be a clear separation between individuals with and without diabetes based on bmi alone. However, we see that the range of BMIs at which patients have diabetes seems to shrink as we observe them to have hypertension and heart disease. However, we also see that there is an overlap between diabetic and non-diabetic patients who have similar blood glucose and HbA1c levels with the main difference being that diabetic patients at a normal level of these tend to have higher BMIs. This suggests that while bmi does not have as strong of an association with diabetes as the other clinical factors, there is still some association.


From the above, we can see that the clinical variables with the highest correlation with diabetes are bmi, blood_glucose_level, and hbA1c_level. Because these 3 variables are all quantitative, we perform PCA analysis to examine those 3 variables and their relationship with diabetes.

## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.1100 0.9669 0.9127
## Proportion of Variance 0.4107 0.3116 0.2777
## Cumulative Proportion  0.4107 0.7223 1.0000

From the above summary of the PCA, we can see that 41.07% of the variance is explained by the first principal component and 31.16% of the variance is explained by the second principal component. We plot the three principal components below.

From the above, we can see that diabetes is heavily associated with PC1, with PC2 and PC3 seeming less correlated with diabetes than the first principal component. Because the principal components are not directly interpretable by themselves, we create the following biplot:

From this, we can see that diabetes is associated with an increased blood_glucose_level and HbA1c_level, whereas bmi does not appear to be heavily associated with having or not having diabetes. We can further see this in the below linear combination of the variables that each principal component represents. In particular, we note that for PC1, the principal component we noted earlier as appearing to be correlated with diabetes, that the component largely consists of the blood_glucose_level and HbA1c_level variables.

##                           PC1        PC2         PC3
## bmi                 0.4689410  0.8819229 -0.04802481
## blood_glucose_level 0.6293765 -0.2955181  0.71871714
## HbA1c_level         0.6196609 -0.3672616 -0.69364205

RQ3: To what extent does smoking history contribute to the overall risk of diabetes?

Lastly, we wanted to explore the relationship between any conscience lifestyle factors and diabetic status. For this we chose smoking history, as it is sometimes considered a risk factor for diabetes. In addition, we want to explore the relationship between smoking and age to get an idea of the relationship between the potential affect of smoking across one’s life and the duration of smoking that effects diabetic status.

We begin by filtering our data. Of our 100,000 observations, only 64,184 respondents answered the question about their smoking history.

Next, lets do some EDA to visualize the relationship between smoking_history and diabetes.

The above bar plot shows the distribution of the respondents’ smoking history and the proportion of each smoking_history that are diabetics. A majority of our subjects have never smoked, but it appears that the groups who have represent a higher proportion of diabetics, and an even higher majority are not diabetics. Of those who have smoked, current and former were the most common responses. To try to better quantify the difference in these proportions, let’s look at a mosaic plot colored by Pearson residuals to determine whether any types of smokers or non smokers are over represented compared to the rest of the data.

From the plot we can see that former smokers have the largest proportion of diabetics. While there does not appear to be a large difference between the different histories, it appears that people who have smoked at some point in time represent a higher proportion of diabetics. The strongest over/under representation in the mosaic plot appears to lie in former smokers and people who have never smoked. They had opposite results, as people who have never smoked are under-represented as diabetics with a residual less than -4, and former smokers are over-represented with a Pearson Residual greater than 4. This indicates there is a relationship between smoking and diabetic status. To statistically determine the significance of this relationship, lets use a chi-squared test for independence.

Chi-Sq Test:

\(H_0\): The distribution of our two variables, smoking history and diabetes, are independent.

\(H_A\): There is a significant dependency between the distributions of diabetic status and smoking history.

## 
##  Pearson's Chi-squared test
## 
## data:  table(smokers.data$smoking_history, smokers.data$diabetes)
## X-squared = 430.91, df = 4, p-value < 2.2e-16

The test produced a p-value less than \(2.2 * 10^{-16}\), which is below the chosen significance level for our test \(\alpha = 0.05\). Thus, we have evidence to reject the null hypothesis and conclude that there is a significant relationship between diabetic status and smoking history.

Given that particularly former smokers had a distinctly different proportion than the rest of the data, we can use the age variable to explore when in life and perhaps how long one’s smoking history is to impact diabetic status. Let’s look at the two most prominent positive smoking histories and the only ones that indicate habit, current and former, distributed by age.

From this we can see that the distribution of the age of former smokers is centered around a higher mean than that of current smokers. Although we don’t have data on how long they smoked or when in their life they smoked there still appears to be an impact on age’s interaction with type of smoking_history, and their impact on predicting diabetic status. To explore this relationship, we will build a logistic regression model to test the two variables, age and smoking_history, and their ability to predict diabetic status.

## 
## Call:
## glm(formula = diabetes ~ smoking_history + age + age * smoking_history, 
##     family = "binomial", data = smokers.data)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -4.7309598  0.1322991 -35.760  < 2e-16 ***
## smoking_historyever             0.3120535  0.2361082   1.322 0.186283    
## smoking_historyformer           0.5750601  0.1874260   3.068 0.002154 ** 
## smoking_historynever           -0.2525157  0.1488058  -1.697 0.089706 .  
## smoking_historynot current      0.4227864  0.1979183   2.136 0.032666 *  
## age                             0.0521902  0.0023817  21.913  < 2e-16 ***
## smoking_historyever:age        -0.0074938  0.0040282  -1.860 0.062836 .  
## smoking_historyformer:age      -0.0098017  0.0031303  -3.131 0.001741 ** 
## smoking_historynever:age        0.0003631  0.0026269   0.138 0.890059    
## smoking_historynot current:age -0.0117212  0.0033586  -3.490 0.000483 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 44422  on 64183  degrees of freedom
## Residual deviance: 39633  on 64174  degrees of freedom
## AIC: 39653
## 
## Number of Fisher Scoring iterations: 6
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: diabetes
## 
## Terms added sequentially (first to last)
## 
## 
##                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                64183      44422              
## smoking_history      4    390.2     64179      44032 < 2.2e-16 ***
## age                  1   4361.7     64178      39670 < 2.2e-16 ***
## smoking_history:age  4     36.7     64174      39633 2.076e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the significance of the model, we use a chi-squared test for logistic regression:

\(H_0\): The coefficients of the predictors of the model are all equal to zero.

\(H_A\): At least one of the coefficients of the model is non-zero.

From this test we can reject the null hypothesis and conclude that our model is statistically significant based on the p-value being lower than the \(\alpha = 0.05\), so our model is significant. From the results of our regression model, we can see that most of the predictors were significant, meaning both age and smoking_history are significant predictors of diabetic status, as well as the interaction between them being significant. However, their coefficients are very low meaning they have a statistically significant, but very small(between 0 and 1) impact on the probability someone is diabetic. While several of the interactions were significant, the lowest p-values (meaning the most significant predictors) and the highest coefficients were for age and smoking_history on their own. So a stronger predictor is not produced from the interaction between age and smoking_history, and in general we have identified significant but minimally impactful relationship of age and smoking history on the odds that a person is diabetic.

This can be visually seen in the plot of the sorted predicted probabilities against the actual diabetic status of our samples. Given that our model doesn’t change the probability of diabetes by more than 1, our samples are not well separated, and contained in the small window of 0% to 30%. This helps visualize the fact that while our model and relationships have been proven to be significant, they are not particularly meaningful in their ability to make distinct and accurate predictions about diabetes based on smoking history and age.

Conclusion

In this exploration of the Diabetes Prediction Dataset, we were able to uncover relationships between demographic attributes, clinical indicators, and the prevalence of diabetes. One takeaway is that males exhibited a higher risk of diabetes than females, potentially due to lifestyle and genetic factors. Age demonstrated a significant correlation with diabetes, with older individuals being at higher risk. In terms of clinical indicators, blood glucose and HbA1c levels emerged as strong indicators, emphasizing their crucial role in diabetes prediction. We also found that former smokers showed a higher proportion of diabetics, indicating a potential link between smoking habits and diabetes risk. Further analysis demonstrated that age interacting with smoking history does not particularly influence predicting diabetic status, although individually they still present as significant predictors. The findings presented in this report have practical implications for healthcare professionals. Targeted screening and preventive measures, especially for older individuals would help as they are more at risk. Blood glucose and HbA1c levels provides valuable insights for on what health aspects to target for treatment.

Further Inquiry

While we were able to find significant relationships between diabetic status, and the various demographic, clinical, and lifestyle factors in our data set, there are important factors absent from the factors we studied that could be valuable in understanding how to better predict diabetes. Having geographic data on where each subject lived, how far they were from the nearest available fresh produce, and their communities’ socioeconomic status could be valuable factors in determining their health and diabetic status, and are worth studying and adding to the models we have created here. In addition, further demographic information such as race, education level, or marital status could provide valuable context to elements that impact personal health. Studying those additional demographic variables, particularly against our clinical variables, could provide insightful relationships about health care and social organization. Lastly for our smoking variable, having more time information like how long a subject smoked, or when in life they started, could help determine what particular pattern of smoking has a significant relationship with diabetic status. In general, more rigorous models on more comprehensive datasets could lead to the construction of powerful predictors of diabetic status that could serve to warn or help people even without seeing a doctor.