Diabetes is a complex metabolic disorder with health implications that are difficult to overcome. In the pursuit of more insights into the factors affecting the prevalence of diabetes, we will explore the Diabetes Prediction Dataset. This dataset includes information on patient demographics, lifestyle, and clinical factors. Our objective is to delve into the relationships between various clinical indicators and demographic attributes to identify patterns seen in individuals with and without diabetes. These patterns have practical applications as they could help us identify those at risk of developing diabetes. Ultimately, this exploration is useful in enhancing patient care. In the following sections, we address three primary research questions that investigate underlying diabetes prevalence:
Research Questions:
To what extent do gender and age correlate with the prevalence of diabetes in the given population?
How do clinical indicators, including hypertension, heart disease, BMI, blood glucose levels, along with HbA1c levels, contribute to the prediction of diabetes?
To what extent does smoking history contribute to the overall risk of diabetes?
This report aims to provide valuable insights that can guide healthcare professionals and researchers in the ongoing battle against diabetes.
Summary of Data
The following are our variables of interest:
gender
: Patient’s biological sex (Female
,
Male
, Other
)
age
: Age of patient (range: 0 - 80
)
hypertension
: Presence of hypertension (elevated blood
pressure) in patient (0
for No, 1
for Yes)
heart_disease
: Presence of heart disease in patient
(0
for No, 1
for Yes)
smoking_history
: Patient’s smoking history
(current
, ever
, former
,
never
, not current
)
bmi
: Patient’s Body Mass Index
HbA1c_level
: Patient’s Hemoglobin A1c level over the
past 2 - 3 months
blood_glucose_level
: Patient’s blood glucose level
diabetes
: Presence of diabetes in patient
(0
for No, 1
for Yes)
We wanted to learn if there exist any demographic patterns in
diabetes prevalence, hence, we are interested in the
gender
, age
, and diabetes
variables. We want to know if there exists any relationships between
these variables. We could use a bar plot to show the distribution of
diabetes cases among different genders. A boxplot would also show the
correlation between age and diabetes.
From the above plot, we can see that we have more female than male
patients in this data set, however the number of females with diabetes
is roughly the same as males. This indicates that males tend to have a
higher proportion of individuals with diabetes than females. The main
takeaway from this graph is that there could be a gender-related
association with diabetes prevalence; this observation prompts further
investigation into lifestyle patterns or genetic makeup that may be
influencing this distinction. We should note that while some patients
are labelled as Other
, there was not a significant number
of them in the dataset.
Turning our attention to age, we will use a boxplot to assess how age correlates with diabetes prevalence.
The box plot indicates a higher median age for individuals with diabetes (age 62) compared to those without diabetes (age 40). Moreover, the interquartile range for individuals with diabetes appears to be shifted towards higher ages. This indicates that older individuals are more likely to have diabetes than younger individuals. This could be because of lifestyle differences, which would be useful to further investigate. This finding tells us screening and preventive measures may need to be tailored based on age groups, with a heightened focus on older individuals.
Through our EDA, it is suggested that there indeed are associations
between gender
and diabetes
, and
age
and diabetes
. We want to test if this is
true, using a chi-squared test.
For gender
:
\(H_0\): There is no association between gender and diabetes status.
\(H_A\): There is a significant association between gender and diabetes status.
##
## Pearson's Chi-squared test
##
## data: gender_table
## X-squared = 143.22, df = 2, p-value < 2.2e-16
Since the p-value (<2.2e-16
) is less than 0.05, we
reject the null hypothesis. Hence, we have enough evidence to conclude
that there is a significant association between gender and diabetes
status.
In practical terms, there is a statistically significant difference in the proportions of diabetes status between different gender groups. Males tend to be at a higher risk of diabetes than females, which we saw in our EDA.
For age
:
\(H_0\): There is no association between age and diabetes status.
\(H_A\): There is a significant association between age and diabetes status.
##
## Pearson's Chi-squared test
##
## data: age_table
## X-squared = 6978.4, df = 3, p-value < 2.2e-16
Since the p-value (<2.2e-16
) is less than 0.05, we
reject the null hypothesis. Hence, we have enough evidence to conclude
that there is a significant association between age and diabetes
status.
In practical terms, there is a statistically significant difference in the proportions of diabetes status between different age groups. Older age groups tend to be at higher risk of diabetes, which we saw in our EDA.
When it comes to medical conditions like diabetes, it is often the case that multiple factors collectively contribute to the risk of developing the condition. We chose to analyze multiple clinical factors together and examine how they collectively are related to diabetes in order to gain a more nuanced understanding of how an individual’s risk for diabetes is influenced by various clinical factors.
From the above heatmaps, it seems that
blood_glucose_level
and HbA1c_level
are
significant predictors of diabetes. For patients with the same
bmi
, diabetic patients seem to have noticeabley higher
blood glucose and HbA1c levels. Additionally, the heatmaps for
blood_glucose_level
versus HbA1c_levels
seems
to suggest there is a strong relationship between a patient having
diabetes and both of these levels being elevated, indicated by the
higher intensity of red in this graph. The lack of outliers in this
graph might also suggest a relationship between blood glucose level and
HbA1c level as well.
The first set of scatterplots shows how
blood_glucose_level
and bmi
are related based
on whether a patient has hypertension and heart diseases for both
diabetic and non-diabetic patients, while the second shows this
relationship for HbA1c_level
and bmi
. In both
these cases we see that diabetic patients show significantly higher
blood_glucose_level
and HbA1c_level
when
compared to non-diabetic patients. There doesn’t seem to be a clear
separation between individuals with and without diabetes based on
bmi
alone. However, we see that the range of BMIs at which
patients have diabetes seems to shrink as we observe them to have
hypertension and heart disease. However, we also see that there is an
overlap between diabetic and non-diabetic patients who have similar
blood glucose and HbA1c levels with the main difference being that
diabetic patients at a normal level of these tend to have higher BMIs.
This suggests that while bmi
does not have as strong of an
association with diabetes as the other clinical factors, there is still
some association.
From the above, we can see that the clinical variables with the
highest correlation with diabetes are bmi
,
blood_glucose_level
, and hbA1c_level.
Because
these 3 variables are all quantitative, we perform PCA analysis to
examine those 3 variables and their relationship with diabetes.
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.1100 0.9669 0.9127
## Proportion of Variance 0.4107 0.3116 0.2777
## Cumulative Proportion 0.4107 0.7223 1.0000
From the above summary of the PCA, we can see that 41.07% of the variance is explained by the first principal component and 31.16% of the variance is explained by the second principal component. We plot the three principal components below.
From the above, we can see that diabetes is heavily associated with PC1, with PC2 and PC3 seeming less correlated with diabetes than the first principal component. Because the principal components are not directly interpretable by themselves, we create the following biplot:
From this, we can see that diabetes
is associated with
an increased blood_glucose_level
and
HbA1c_level
, whereas bmi
does not appear to be
heavily associated with having or not having diabetes. We can further
see this in the below linear combination of the variables that each
principal component represents. In particular, we note that for PC1, the
principal component we noted earlier as appearing to be correlated with
diabetes, that the component largely consists of the
blood_glucose_level
and HbA1c_level
variables.
## PC1 PC2 PC3
## bmi 0.4689410 0.8819229 -0.04802481
## blood_glucose_level 0.6293765 -0.2955181 0.71871714
## HbA1c_level 0.6196609 -0.3672616 -0.69364205
Lastly, we wanted to explore the relationship between any conscience lifestyle factors and diabetic status. For this we chose smoking history, as it is sometimes considered a risk factor for diabetes. In addition, we want to explore the relationship between smoking and age to get an idea of the relationship between the potential affect of smoking across one’s life and the duration of smoking that effects diabetic status.
We begin by filtering our data. Of our 100,000 observations, only 64,184 respondents answered the question about their smoking history.
Next, lets do some EDA to visualize the relationship between
smoking_history
and diabetes
.
The above bar plot shows the distribution of the respondents’
smoking history
and the proportion of each
smoking_history
that are diabetics. A majority of our
subjects have never smoked, but it appears that the groups who have
represent a higher proportion of diabetics, and an even higher majority
are not diabetics. Of those who have smoked, current and former were the
most common responses. To try to better quantify the difference in these
proportions, let’s look at a mosaic plot colored by Pearson residuals to
determine whether any types of smokers or non smokers are over
represented compared to the rest of the data.
From the plot we can see that former smokers have the largest proportion of diabetics. While there does not appear to be a large difference between the different histories, it appears that people who have smoked at some point in time represent a higher proportion of diabetics. The strongest over/under representation in the mosaic plot appears to lie in former smokers and people who have never smoked. They had opposite results, as people who have never smoked are under-represented as diabetics with a residual less than -4, and former smokers are over-represented with a Pearson Residual greater than 4. This indicates there is a relationship between smoking and diabetic status. To statistically determine the significance of this relationship, lets use a chi-squared test for independence.
Chi-Sq Test:
\(H_0\): The distribution of our two variables, smoking history and diabetes, are independent.
\(H_A\): There is a significant dependency between the distributions of diabetic status and smoking history.
##
## Pearson's Chi-squared test
##
## data: table(smokers.data$smoking_history, smokers.data$diabetes)
## X-squared = 430.91, df = 4, p-value < 2.2e-16
The test produced a p-value less than \(2.2 * 10^{-16}\), which is below the chosen significance level for our test \(\alpha = 0.05\). Thus, we have evidence to reject the null hypothesis and conclude that there is a significant relationship between diabetic status and smoking history.
Given that particularly former smokers had a distinctly different proportion than the rest of the data, we can use the age variable to explore when in life and perhaps how long one’s smoking history is to impact diabetic status. Let’s look at the two most prominent positive smoking histories and the only ones that indicate habit, current and former, distributed by age.
From this we can see that the distribution of the age of former
smokers is centered around a higher mean than that of current smokers.
Although we don’t have data on how long they smoked or when in their
life they smoked there still appears to be an impact on age’s
interaction with type of smoking_history
, and their impact
on predicting diabetic status. To explore this relationship, we will
build a logistic regression model to test the two variables, age and
smoking_history
, and their ability to predict diabetic
status.
##
## Call:
## glm(formula = diabetes ~ smoking_history + age + age * smoking_history,
## family = "binomial", data = smokers.data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7309598 0.1322991 -35.760 < 2e-16 ***
## smoking_historyever 0.3120535 0.2361082 1.322 0.186283
## smoking_historyformer 0.5750601 0.1874260 3.068 0.002154 **
## smoking_historynever -0.2525157 0.1488058 -1.697 0.089706 .
## smoking_historynot current 0.4227864 0.1979183 2.136 0.032666 *
## age 0.0521902 0.0023817 21.913 < 2e-16 ***
## smoking_historyever:age -0.0074938 0.0040282 -1.860 0.062836 .
## smoking_historyformer:age -0.0098017 0.0031303 -3.131 0.001741 **
## smoking_historynever:age 0.0003631 0.0026269 0.138 0.890059
## smoking_historynot current:age -0.0117212 0.0033586 -3.490 0.000483 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 44422 on 64183 degrees of freedom
## Residual deviance: 39633 on 64174 degrees of freedom
## AIC: 39653
##
## Number of Fisher Scoring iterations: 6
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: diabetes
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 64183 44422
## smoking_history 4 390.2 64179 44032 < 2.2e-16 ***
## age 1 4361.7 64178 39670 < 2.2e-16 ***
## smoking_history:age 4 36.7 64174 39633 2.076e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the significance of the model, we use a chi-squared test for logistic regression:
\(H_0\): The coefficients of the predictors of the model are all equal to zero.
\(H_A\): At least one of the coefficients of the model is non-zero.
From this test we can reject the null hypothesis and conclude that
our model is statistically significant based on the p-value being lower
than the \(\alpha = 0.05\), so our
model is significant. From the results of our regression model, we can
see that most of the predictors were significant, meaning both
age
and smoking_history
are significant
predictors of diabetic status, as well as the interaction between them
being significant. However, their coefficients are very low meaning they
have a statistically significant, but very small(between 0 and 1) impact
on the probability someone is diabetic. While several of the
interactions were significant, the lowest p-values (meaning the most
significant predictors) and the highest coefficients were for
age
and smoking_history
on their own. So a
stronger predictor is not produced from the interaction between
age
and smoking_history
, and in general we
have identified significant but minimally impactful relationship of age
and smoking history on the odds that a person is diabetic.
This can be visually seen in the plot of the sorted predicted
probabilities against the actual diabetic status of our samples. Given
that our model doesn’t change the probability of diabetes
by more than 1, our samples are not well separated, and contained in the
small window of 0% to 30%. This helps visualize the fact that while our
model and relationships have been proven to be significant, they are not
particularly meaningful in their ability to make distinct and accurate
predictions about diabetes
based on
smoking history
and age
.
In this exploration of the Diabetes Prediction Dataset, we were able to uncover relationships between demographic attributes, clinical indicators, and the prevalence of diabetes. One takeaway is that males exhibited a higher risk of diabetes than females, potentially due to lifestyle and genetic factors. Age demonstrated a significant correlation with diabetes, with older individuals being at higher risk. In terms of clinical indicators, blood glucose and HbA1c levels emerged as strong indicators, emphasizing their crucial role in diabetes prediction. We also found that former smokers showed a higher proportion of diabetics, indicating a potential link between smoking habits and diabetes risk. Further analysis demonstrated that age interacting with smoking history does not particularly influence predicting diabetic status, although individually they still present as significant predictors. The findings presented in this report have practical implications for healthcare professionals. Targeted screening and preventive measures, especially for older individuals would help as they are more at risk. Blood glucose and HbA1c levels provides valuable insights for on what health aspects to target for treatment.
While we were able to find significant relationships between diabetic status, and the various demographic, clinical, and lifestyle factors in our data set, there are important factors absent from the factors we studied that could be valuable in understanding how to better predict diabetes. Having geographic data on where each subject lived, how far they were from the nearest available fresh produce, and their communities’ socioeconomic status could be valuable factors in determining their health and diabetic status, and are worth studying and adding to the models we have created here. In addition, further demographic information such as race, education level, or marital status could provide valuable context to elements that impact personal health. Studying those additional demographic variables, particularly against our clinical variables, could provide insightful relationships about health care and social organization. Lastly for our smoking variable, having more time information like how long a subject smoked, or when in life they started, could help determine what particular pattern of smoking has a significant relationship with diabetic status. In general, more rigorous models on more comprehensive datasets could lead to the construction of powerful predictors of diabetic status that could serve to warn or help people even without seeing a doctor.