Body Mass Index (BMI) has been the standard health metric, but its accuracy and in predicting health outcomes have been challenged. BMI does not account for the difference between muscle and fat mass or any other demographics; this means that the BMI of a person with a lot of muscle mass may be the same as a person with obesity. Given its outdated nature, we plan to investigate whether BMI truly correlates with critical lifestyle factors such as sleep patterns and physical activity levels in this paper. We also plan to investigate the role of stress and its potential association with BMI across various occupations, which would help us uncover nuanced interactions that may challenge traditional health narratives.
This study is motivated by two pressing questions:
The results of the first two questions serve as a foundation for an alternative question: 1. Do sleep disorders affect steps and sleep duration?
By addressing these questions, this research will shed light on the limitations of BMI as a health indicator and explore alternative perspectives for understanding the interplay between lifestyle factors, stress, and health.
To answer these questions we will be using the Sleep Health and Lifestyle Dataset. Thiis is a Kaggle dataset that has 400 rows spanning 12 variables pertaining to sleep and health habits among males and females. This is a synthetic dataset with the purpose of being used to analyze sleep patterns, lifestyle factors, cardiovascular health, and sleep disorders. The dataset contains qualitative variables such as gender, occupation, BMI category, blood pressure, and sleep disorder as well as quantitative variables such as age, sleep duration, quality of sleep, physical activity level, stress level, heart rate, and daily steps.
Looking at our first research question about whether BMI is a good indicator of how physically fit an individual is, we will be using BMI category, daily steps, and physical activity level as measures of physical fitness and looking at this between males and females. For our second question about whether stress level has an effect on BMI category and whether this changes by the job an individual has, we will be using BMI category, stress level, and occupation to make our conclusions. Lastly, for our third research question about whether sleep patterns have an impact on an individual’s physical fitness, we will be using the daily steps, sleep duration, sleep disorder, and gender variables, especially to observe whether the influence differs by whether an individual is male or female.
Using these variables in the dataset and performing various EDA and tests on them can tell us more about whether BMI is a good measure of assessing the health of an individual and if that’s not the case, whether sleep disorder has an impact.
To address our first research question, we will first perform some EDA to see how the different variables interact with one another as well as general observations regarding how BMI category corresponds to the variables among different genders. This will help us form some initial assumptions regarding the usefulness of BMI category as an indicator of physical fitness before we perform statistical tests and observe how much of the data is clustered or practical to use through a principal component analysis.
To explore the association between the two metrics of physical fitness we were assessing, we first performed some exploratory data analysis on daily steps and physical activity level by looking at their correlation as well as a linear regression plot by gender. In the first plot, we see a scatterplot that has a roughly linear trend where high levels of physical activity correspond to a larger number of daily steps. It makes intuitive sense that people who take more steps may tend to be more physically active. In the top right square of this plot, the correlation between the two variables shows to be 0.773 which is a highly positive correlation, aligning with the scatterplot and the intuition we had going into this analysis.
Looking at the second plot, we see two curves, one for males and one for females. This aligns with the first plot we observed in that there is a strong positive correlation between daily steps and physical activity level for both males and females (with slightly different slopes). When we do further analysis with this data, this information will help inform that physical activity level and daily steps, within the context of this dataset, may behave similarly or follow similar patterns when influenced by other variables.
Before looking at the relationship between BMI and the two metrics of physical fitness, let’s explore the BMI variable itself through PCA. We have a PCA and scree plot to see how different BMI categories are distributed in the PC1 by PC2 plot as well as to show how many PCs are useful in this dataset (which would be about 4 PCs). The scree plot, in general, just shows how many principal components are useful in making conclusions on our data by the percentage of explained variance from each PC. This variance drops closer to 0 after the 4th principal component. Looking at the PCA plot that plots PC1 on the x axis and PC2 on the y axis with points colored by BMI category, we see groups of similar BMI classifications are clustered together around certain parts of the plot. For example, we see people in the normal BMI category grouped together in scattered parts about both axes. People in the normal weight BMI category seem to be linearly spread across the PC1 and PC2 axes. For people in the overweight BMI category, most points are in the lower ranges for both PC1 and PC2.
This plot is overall helpful to highlight because it shows similarities and differences among BMI categories. In terms of this PCA plot, they do not seem to have extremely different trends other than the fact that Normal BMI observations are usually clustered together.
This violin plot shows the association between BMI (y axis) and daily steps which is displayed on the x axis. For the normal and normal weight categories, we see that the distribution of this data is mostly centered between 6,000-9,000 steps per day which is higher than the range of daily steps for people in the obese to overweight categories. For people in the obese category, daily steps seemed to be centered around 3,000-4,000 while people in the overweight category were centered closer to 5,000-7,000 steps.
If we look at this plot under the lens of gender, female daily steps seemed to have a much larger range in the overweight and normal weight category compared to males while males had a larger range for the normal category. The center of the distribution seems to be similar between males and females for the overweight and normal weight categories but extremely different among females for the normal category.
There are many factors we can attribute this pattern to given that the types of conditions, body types, lifestyles, and stress factors greatly differ between women and men especially considering many women given birth to children, and there are occupation gaps between the genders.
Next, we observe violin plots that show the association between BMI (y axis) and physical activity level (higher is more physically active) which is displayed on the x axis. Given that the distributions were more difficult to compare here among normal, normal weight, and obese categories, we added boxplots to the violins to show where the centers were. We see that for these three categories, the centers are almost identical, which may point to the idea that BMI category does not differ greatly between various physical activity levels. The only category that seems significantly different is the overweight category where the center is much lower compared to the other three categories.
While the distribution shapes are different, the centers are fairly similar among the different BMI categories. We can formally test this as well after observing the differences among genders.
The only major difference by gender we see for the centers of the distribution is among people in the overweight and normal categories, but like noted with daily steps, there are a variety of factors to attribute this to. Also the ranges for females seem to differ from males in this as well with some ranges and spreads being fairly wider likely due to a variety of other lifestyle factors to consider that this dataset may not have included.
##
## Pearson's Chi-squared test
##
## data: table_BMI_gender
## X-squared = 59.394, df = 3, p-value = 7.921e-13
We can formally test the differences in BMI among males and females by performing a two sample KS test under the null hypothesis that males and females have the same distribution for various BMI categories. We can see from this output, with 95% confidence, that there is an observed p value of 7.921e-13 which is small enough to reject the null hypothesis and conclude that BMI categories differ greatly between males and females. Putting this together with the tests below, we can make more sense of the gender discrepancies we saw in previous plots.
## Df Sum Sq Mean Sq F value Pr(>F)
## `BMI Category` 3 124480005 41493335 18.02 6.16e-11 ***
## Residuals 370 851903872 2302443
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the one way ANOVA test of whether there is a different distribution of daily steps for different levels of BMI, we form the null hypothesis that all BMI categories see the same distribution of daily steps. We see from the observed p value of 6.16e-11 that this is approximately zero. We reject the null hypothesis and conclude that there is a significantly different number of daily steps people take in different BMI categories.
## Df Sum Sq Mean Sq F value Pr(>F)
## `BMI Category` 3 1260 420 0.968 0.408
## Residuals 370 160593 434
We can do the same for physical activity level. We can perform a one way ANOVA test to see whether there is a different distribution of physical activity level for different levels of BMI with the null hypothesis that all BMI categories see the same distribution of physical activity level. We see from the observed p value of 0.408 that this is too high of a value to reject the null. We do not have enough evidence to conclude that there is a significantly different level of physical activity that people have in different BMI categories.
This is an interesting outcome because our initial analysis showed that daily steps and physical activity likely behave similarly.
Ultimately, to address this research question, we see that there is a statistically significant relationship between BMI and daily steps but not BMI and physical activity, even though the two metrics for physical fitness have a high correlation. This may be due to the significance that gender has on BMI has well that we saw from the chi squared test. Based on these outcomes and the EDA that showed the difference in distributions for different values of physical fitness, it seems that while BMI could be an indicator of physical fitness on some level, there seem to be many other factors at play that we can further explore in this report such as sleep metrics and occupation.
To address the seconds question, we can use a heatmap to understand the relationship between mean BMI, stress level, and occupation.
## Rows: 374 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Occupation, BMI Category, Blood Pressure, Sleep Disorder
## dbl (8): Person ID, Age, Sleep Duration, Quality of Sleep, Physical Activity...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## `summarise()` has grouped output by 'Stress.Level'. You can override using the `.groups` argument.
This heat map displays the relationship between mean BMI, stress level, and occupation. The x-axis represents the occupation type (example: accountant, doctor, etc), the y-axis represents the stress level in numeric values, and the color legend on the right of the graph represents the mean BMI. The red color represents higher BMI values while blue represents lower BMI values.
The map suggests that stress level does in fact affect BMI differently based on the occupation. For example, occupations like nurse or teacher display higher BMI (as indicated by the red values) even when there are moderate stress levels. Occupations like engineer and doctor show lower BMI (as indicated by the blue and purple colors) overall. This may be because certain occupations like nurses and teachers may have more emotional labor or irregular schedules based on scheduled cases or substitute teaching which could contribute to higher BMI regardless of stress level. It is also possible that engineers have a more structured routine that may help mitigate the impact of stress on their BMI. More analysis will need to be conducted to see if there is a proper relationship.
In order to see if there is an association between stress level and its effect on BMI, we will start with some EDA. The box plot displays the distribution of BMI across varying stress levels. The x-axis represents the stress level (ranging from 3-8), and the y-axis represents the BMI values (ranging from 1-4). Each box displays the interquartile range or IQR of BMI at a specific stress level, with the median represented as a bold black horizontal line and outliers represented as dots.
The graph suggests that BMI does vary based on stress level. Lower stress levels at 3 and 4 have a much wider range of BMI values in comparison to moderate stress levels at 5 and 6, which have more concentrated distributions and lower BMI values. The highest stress levels at 7 and 8 have a higher BMI, which may indicate a non-linear relationship between stress and BMI. In general, the box plot displayed may suggest stress level could influence BMI differently.
## Df Sum Sq Mean Sq F value Pr(>F)
## Stress.Level 5 203.4 40.67 26.7 <2e-16 ***
## Residuals 368 560.7 1.52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conducted an ANOVA test for differences in BMI across stress levels. The results yielded a p value of <2e-16, which is well below the 0.05 significance level. This means that we reject the null hypothesis and that the differences in BMI across different stress levels are in fact statistically significant. The F value computed (26.7) is also relatively large, supporting the effect of stress level on BMI. Moreover, a higher F value may suggest that the variability between groups (stress levels) is greater than the variability within groups (residuals). For further studies, we would need to determine which level of stress differs in BMI averages significantly.
Now, to see if occupations influence sleep duration, we will once again start with some EDA. The bar chart displays the mean sleep duration in hours across different occupations. The x-axis represents the occupation while the y-axis represents the mean sleep duration measured in hours. The different colors between bars emphasize the different occupations.
Occupations such as engineers and lawyers seem to have slightly higher average sleep duration while people who work in sales have lower averages. This is not unexpected, as people who are engineers have a better work-life balance than people who may work in sales, which may be more demanding. The data as a whole seems relatively evenly distributed. There appears to be a slight variation in mean sleep duration across occupations but further tests may have to be conducted to see if the differences are statistically significant.
## Df Sum Sq Mean Sq F value Pr(>F)
## Stress.Level 5 203.4 40.67 26.7 <2e-16 ***
## Residuals 368 560.7 1.52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conducted an ANOVA test for differences in sleep duration among occupations. The results yielded a p value of <2e-16, which is well below the 0.05 significance level. This means that we reject the null hypothesis and that the differences in mean sleep duration across occupations are highly statistically significant. The F value computed (20.63) is also relatively large, supporting the alternative hypothesis. Moreover, a higher F value may suggest that the variability between groups (occupations) is greater than the variability within each occupation group. As seen in the computed bar chart, engineers and lawyers have longer sleep duration than people who work in sales. We would need to pursue further studies that may show which occupations differ significantly from one another.
After answering our first two questions, we know that there is a statistically significant relationship between BMI and daily steps but not BMI and physical activity. We also know that there is are statistically significant differences in mean sleep duration across occupations and differences in BMI across different stress levels. We can now move onto understanding if there are other factors that affects your physical activity and sleep duration.
To address this research question, we will first perform some EDA to see how the different variables interact with one another in order to make initial assumptions in understanding if sleeping disorders are also health indicators.
Based on these box plots, we can make observations in how steps and
sleep vary based on different sleep disorders. For the first graph on
the number of steps people take based on their sleeping disorders, we
can see that participants with insomnia tend to have the lowest range of
daily steps. There is very little variation, and the median is close to
6000 steps per day. Outliers exist below 4000 steps. Individuals without
any sleep disorders show a higher median number of daily steps (around
7500) and a more balanced distribution. There is some variability but
fewer outliers compared to insomnia. Finally, participants with sleep
apnea tend to have the widest range of daily steps, with the upper
whisker extending up to 10,000 steps. The median is the highest among
the three groups, indicating a tendency for greater physical activity.
Based on this box plot, we can say that sleep apnea may be associated
with higher levels of physical activity compared to insomnia, possibly
due to differences in energy levels, fitness, or motivation. People with
insomnia seem to engage in fewer steps.
Based on the second boxplot that shows the relation between different sleep disorders and how many hours of sleep individuals get, we can see that individuals with insomnia consistently report shorter sleep durations, with very tight clustering around the median (6.5 hours). Outliers exist below 6 hours, showing that some participants get very little sleep. People without any sleep disorders sleep for about 7–8 hours on average, as indicated by the wider box and higher median. Their sleep durations are more evenly distributed compared to the insomnia group. The sleep apnea group shows the widest variability in sleep durations. While the median is around 8 hours, participants report both very low and very high sleep durations. Overall, we can see that insomnia has the most consistent and lowest sleep durations, Sleep apnea participants show a more complex pattern, potentially due to fragmented sleep or daytime napping compensating for nighttime disruptions.
From both of the graphs, we can see that individuals with insomnia consistently show the least physical activity and sleep duration, likely reflecting the impact of the disorder on energy and daily functioning. On the other hand, individuals with sleep apnea, while facing potential sleep disruptions, show more variability in both sleep duration and physical activity levels.
To further understand the relationship between these three variables, we can create an MDS plot and PCA to transform are data into a lower dimensional space to visualize the similarity or dissimilarity of data points.
This is a PCA biplot, which visualizes the distribution of data along two principal components. Dim1 explains 52% of the total variance, and Dim2 explains 48%. The data points are grouped into three categories based on their colors and shapes, corresponding to Insomnia (blue circles), None (yellow triangles), and Sleep Apnea (gray squares). There appears to be some separation among these groups, particularly along Dim1 and Dim2. None (yellow triangles) is distributed across a wide range, especially in the positive region of Dim1. Insomnia (blue circles) is concentrated more in the lower left of the graph, suggesting a distinct clustering pattern. Sleep Apnea (gray squares) is sparsely scattered, often overlapping with None. The clustering suggests that sleep disorders like insomnia and sleep apnea might exhibit unique patterns in their activity or sleep data compared to individuals without these disorders.
The MDS plot titled “Clustering Sleep and Lifestyle by Disorder” visually represents participants grouped based on similarities in their sleep and lifestyle characteristics, with three disorder categories: Insomnia (red), None (green), and Sleep Apnea (blue).The Sleep Apnea group appears somewhat separated on the left-hand side of the MDS Dimension 1 axis, with several data points clustering together. This suggests that individuals with Sleep Apnea may share more similar sleep and lifestyle patterns compared to the other groups. The None group is more dispersed, suggesting a broader range of variability in sleep and lifestyle characteristics.The Insomnia group also shows a fair amount of dispersion but tends to cluster more on the positive side of MDS Dimension 1. There is some overlap between the None and Insomnia groups, indicating that individuals in these categories may have similar sleep or activity patterns in certain cases.
To formally test if there are any statistically significant differences, we did an ANOVA test because we want to compare the means of three groups to determine if there are statistically significant differences among them.
## Df Sum Sq Mean Sq F value Pr(>F)
## Sleep.Disorder 2 115047282 57523641 24.78 7.94e-11 ***
## Residuals 371 861336595 2321662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first ANOVA test determines if there is a significant difference in steps taken based on sleeping disorders. The null hypothesis states that there is no significant difference. With the p value of 7.94e-11 which is less than 0.05, we can reject the null hypothesis. We can state that there is statistically significant data to prove that there is a difference in steps taken based on the sleeping disorders.
## Df Sum Sq Mean Sq F value Pr(>F)
## Sleep.Disorder 2 34.66 17.331 31.91 1.63e-13 ***
## Residuals 371 201.47 0.543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The second ANOVA test determines if there is a significant difference in sleep duration based on sleeping disorders. With the p value of 1.63e-13 which is less than 0.05, we can reject the null hypothesis. We can state that there is statistically significant data to prove that there is a difference in sleep duration based on the sleeping disorders.
A driving question for this data analysis was understanding how to create healthies lifestyles, specially in relation to physical activity and sleep. Historically, BMI has been used to understand these topics, but this data proves that sleep disorders are also significant indicators for a lack of physical activity and sleep.
Our analysis proves that there statistically significant relationship between BMI and daily steps but not BMI and physical activity. This result can conclude that one’s BMI can affect the number of daily steps they take.One area of concern in our analysis is that it does not affect physical activity despite walking being considered an active part of one’s day. There could be many contributing factors to this conclusion such as BMI being a hindrance to participating in higher levels of physical activity, but further analysis would be needed to understand that correlation.
Our second question also proves that there is are statistically significant differences in mean sleep duration across occupations and differences in BMI across different stress levels. With many high stress jobs leaving people with tight schedules, it is not surprising that people in certain occupations sleep less than others. Tight schedules could also relate to higher levels of stress. It is interesting to see that stress levels have a significant difference on BMIs. This could mean that higher levels of stress lead to higher levels of BMI. This is an interesting reserach topic to further explore to understand the correlations between stress and helath.
Although our first two questions prove that BMI is a relatively good health indicator, it was still interesting to understand other factors that affect physical activity and sleep duration. Specifically, sleep disorders were proven to negatively affect these two. People with sleep disorders were proven to sleep less. This can now be further studied as another factor that decreases the quality of ou health.
Overall, this analysis helps us understand how we can improve our lifestyle factors, stress, and health. Although some factors such as occupation and existing sleeping conditions can not be controlled, it would be interesting to experiment on which physical activities could improve these factors of our daily lives. We also now know that BMI can be affected by decreasing stress, so it is important to further research ways to create less stressful environment in demanding work environments to keep ourselves healthy.