As CMU students, we all know the struggle of having heavy course loads, late nights, and the constant pressure to perform well academically, often coming at the expense of our sleep. While we’ve all heard how important sleep is, it’s hard not to wonder: does it really matter how consistent our sleep schedule is, or is it just about getting enough hours? What aspects of our sleep are most important, and how do these patterns vary across different demographics? And how do these habits ultimately influence our GPAs in the context of our coursework?
This research dives into these questions, exploring how our sleep patterns, total sleep time, GPA, course load, and more interact with one another. By analyzing these connections, we aim to uncover whether consistent sleep routines are more important than sheer sleep duration, and how course load might amplify or mitigate these effects. Additionally, by clustering students based on their sleep and academic behaviors, we can identify clear patterns, such as what habits set high performers apart from those who struggle to have a higher GPA.
Ultimately, this study will help with understanding how we, as students, can strike a better balance between our sleep, workload, and academic success. By analyzing these relationships, we hope to offer practical insights that help us prioritize both our health and sleep schedules and our academic goals.
The dataset contains information about the sleep activity, academic performance, demographics, and workload of university students who participated in the study. There are 634 observations in the dataset, representing the 634 total participants in the study. The study comprises 5 different cohorts, observing first-year students from Notre Dame, the University of Washington, and Carnegie Mellon University in different semesters. The first cohort assesses students from Carnegie Mellon University in the Spring 2018 semester, the second cohort contains students from the University of Washington in Spring 2018, the third cohort has students from the University of Washington from Spring 2019, the fourth contains students from Notre Dame University in Spring 2016, and the fifth has students from Carnegie Mellon University in Spring 2017. There are 15 total variables in the dataset: subject_id, study, cohort, demo_race, demo_gender, demo_firstgen, bedtime_mssd, TotalSleepTime, midpoint_sleep, frac_nights_with_data, daytime_sleep, cum_gpa, term_gpa, term_units, and Zterm_units_ZofZ. Five of the variables are categorical; namely, study, cohort, demo_race, demo_gender, and demo_firstgen, and the other ten variables are quantitative.
All of the sleep and activity data of the first-year students were collected using fitbit data that would detect when the students were sleeping. A given sleep episode was only counted in the study if it was at least 20 minutes in duration, separated by at least 5 minutes from other episodes. Hence, if a student was only awake for <5 minutes and fell back asleep, the entire session was considered one sleep episode. The TotalSleepTime variable measured the length of a participant’s main sleep episode, which was classified as the longest sleep session between noon of one day and noon of the next. One constraint of this study was that wearing the fitbit was up to the discretion of the students, so how much information received on sleep and activity patterns is dependent on how frequent they wore the fitbit. To account for this limitation, the frac_nights_with_data variable represents the fraction of nights data was captured for a given student.
Here are our three research questions:
Does cohort information have an impact on whether or not a subject has more traditional sleep patterns or not? How important are sleep patterns and the length of the main sleep episode in relation to academic performance?
Which attribute of a student’s sleep (among total amount of time slept at night, the midpoint of your sleep time, or amount of time slept in the daytime) explains the most variability in the dataset, and how does that sleep measurement influence academic performance (cum_gpa and term_gpa) across different gender and race groups?
How do bedtime variability, total sleep time, and course load interact with GPA, and which factor seems to matter most for academic success? What can the clustering patterns reveal about different student habits and behaviors, and how do these groups show the importance of balancing consistent sleep, workload, and strong academic outcomes?
The study consists of 5 different subgroups, hence, we want to examine
if the university or semester a participant belonged to influences the
sleeping pattern of a participant. The two side-by-side violin plots
depict the distributions of the main sleep episode length and the amount
of daytime sleep (naps) of the first-year students in each study cohort.
The first figure displays the average amount of minutes of sleep a
student earned during their main sleep episode on the x-axis, and the
y-axis displays the cohort that the group of students belong to, with
each violin plot colored by the cohort as well. Looking at the median
values of the boxplots, the Notre Dame cohort has the lowest median
TotalSleepTime value, followed by the two Carnegie Mellon cohorts, and
the University of Washington groups possess the highest median
TotalSleepTime values. The kernel densities of the violins in the first
plot convey that the Notre Dame and University of Washington 2 cohorts
have approximately normal distributed data, while the CMU and University
of Washington 1 cohorts have more skewed distributions, particularly the
CMU 2 and UW 1 groups. It is of note that the CMU 2 and UW 1 groups also
have more curvature in their kernel density plots, compared to the
smoother densities of the CMU 1 and Notre Dame groups.
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
After examining trends in TotalSleepTime among cohorts, we wanted to understand how these patterns compared to daytime_sleep, which is the variable representing the average sleep time outside the main sleep episode, such as naps. The second side-by-side violin plot shows the distributions of the average sleep time among the different cohorts, also colored by the cohort. Based on the boxplots of each cohort, the CMU 1 group has the lowest median daytime_sleep value, while CMU 2 and UW 2 groups have the next two lowest median values, and the Notre Dame and UW 1 groups have the second highest and highest median values. Based on this observation, The kernel densities of the violins are all smoother in this plot compared to the side-by-side violin plot measuring TotalSleepTime, indicating daytime_sleep is more normally distributed compared to the average amount of time of the main sleep episode. The right tails of all the violins in this plot signify that all the distributions of daytime_sleep are right skewed, meaning that there are students in each cohort with above average amounts of daytime sleep causing the average amount of daytime sleep received to be greater than the median value of daytime sleep. Examining the two plots together, an interesting finding is that the two cohorts from the University of Washington, which possessed the highest median amounts of TotalSleepTime, also had generally higher levels of average daytime sleep compared to the CMU cohorts, while the Notre Dame cohort had higher amounts of daytime sleep compared to other cohorts, though possessed lower sleep time for the main episode
## Warning: Removed 147 rows containing non-finite outside the scale range
## (`stat_density()`).
In an effort to further investigate the differences between the subjects
in each cohort we plotted the distributions of Zterm_units_ZofZ for all
groups, which are the z-score values of the z-scores of a student’s
academic workload, measuring a student’s relative academic workload
standardized by cohort and then standardized with all cohorts combined.
The conditional density plot above showcases the proportions (densities)
of students in each cohort with each academic workload level. A negative
value of Zterm_units_ZofZ indicates a student possessed a below average
academic workload during the semester of the study, while a positive
value indicates an above average academic workload for the semester. The
UW 1 cohort has a tri-modal or quad-modal distribution with the largest
peak slightly above the 0 z-score, representing an above-average
academic load compared to the other subjects in the study. The other
University of Washington cohort, UW 2, notably also has a multimodal
distribution of Zterm_units_ZofZ indicating fluctuations in the
proportions of students within the cohort with varying academic
workloads.
The CMU cohorts have more normally distributed densities of academic workload that are centered at approximately, or just below the 0 z-score, with some curvature found at the tail ends of both distributions. With all of the distributions overlaid upon one another and the alpha of the density curves set to 0.3, we can see that the students in the CMU cohorts may have slightly lower academic loads compared to the students in the UW cohorts, as the CMU distributions are centered more to the left of the center of UW distributions, indicating lower median z-score values. It is important to highlight that academic workload was not measured for the Notre Dame students, as this cohort does not contain Zterm_units_ZofZ values. Therefore, we are unable to compare the academic load of Notre Dame students with those of the students in the other schools represented in the study and their cohort is not represented in the conditional density plot.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
After comparing the distributions of the average amount of daytime sleep and total amount of sleep during the main sleep episode of every cohort, it is helpful to examine the relationships between the two variables themselves, as well as how these two sleep attributes interact with academic performance, reflected by the variable term_gpa, which were the subjects’ final GPA earned for the term being studied.
The pairs plot includes scatterplots of each variable against the other two variables, density plots of the distributions of the variables, and the correlation coefficients of the pairs of variables. The daytime_sleep variable is skewed to the right and has a negative correlation with TotalSleepTime and term_gpa, based on the scatterplots of the two pairs of variables. The -0.293 and -0.153 correlation coefficients between daytime_sleep and TotalSleepTime and term_gpa, respectively, indicates there is a weak inverse relationship between the amount of sleep in naps earned during the day and the amount of sleep earned in the main sleep episode at night, as well as the academic performance of a subject. Meanwhile, the correlation coefficient between TotalSleepTime and term_gpa is 0.202, signifying that for more minutes of sleep earned at night, subjects generally earn higher GPAs for the term.
Looking at the density plots of TotalSleepTime and term_gpa, we see that TotalSleepTime is normally distributed, while term_gpa has a left-skewed distribution, as the majority of subjects in the study have GPAs between 3.0 and 4.0. In summary, the pairs plot allows us to assess the relationships between the amount of sleep earned via day naps, the amount of sleep earned at night, and academic performance. It appears that students with higher amounts of sleep earned during the day earn lower amounts of sleep at night and lower GPAs, compared to subjects who earned more sleep at night during the main sleep episode and less sleep throughout the day.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.4269 1.1144 1.0126 0.8818 0.75387 0.59229
## Proportion of Variance 0.3393 0.2070 0.1709 0.1296 0.09472 0.05847
## Cumulative Proportion 0.3393 0.5463 0.7172 0.8468 0.94153 1.00000
## PC1 PC2 PC3 PC4 PC5
## daytime_sleep 0.3086703 0.1487732 0.6928268 0.52739944 -0.34861487
## cum_gpa -0.4934606 0.4992293 0.1434700 -0.06317500 0.07237972
## term_gpa -0.5221223 0.4416212 0.1038964 -0.01591543 -0.19861797
## midpoint_sleep 0.4128189 0.4041989 -0.2958280 -0.39984697 -0.64018947
## frac_nights_with_data -0.2405845 -0.4837748 0.5026747 -0.59545117 -0.31330859
## TotalSleepTime -0.4003838 -0.3690428 -0.3852491 0.45073912 -0.57076696
## PC6
## daytime_sleep -0.05383074
## cum_gpa -0.69097987
## term_gpa 0.69416292
## midpoint_sleep -0.09470722
## frac_nights_with_data -0.05171841
## TotalSleepTime -0.16168696
The PCA biplot above provides insight into the relationships between variables and observations in reduced dimensions. The x-axis (Dimension 1) explains 33.9% of the variation, while the y-axis (Dimension 2) accounts for 20.7%, together capturing 54.6% of the dataset’s variability. The Midpoint Sleep and Daytime Sleep vectors are positively correlated with each other, as they align closely in the same direction. Both are associated with Dimension 1 and contribute significantly to the variance captured by this dimension. On the other hand, Cumulative GPA (cum_gpa) and Term GPA (term_gpa) are strongly correlated with each other (similar direction) and contribute primarily to Dimension 2. This makes sense since they are both GPAs and your term GPA is factored into your cumulative GPA. Interestingly, their near-perpendicularity to midpoint_sleep and daytime_sleep suggests little direct linear correlation with these sleep variables.
Furthermore, the Total Sleep Time and Fraction of Nights with Data vectors point in a different direction compared to the GPA variables, suggesting an inverse relationship with GPAs. They are negatively correlated with Dimension 2 but share some alignment with Dimension 1.
It seemed that midpoint_sleep explained the most variability in the data, which is why we selected it for the remainder of the research question.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = cum_gpa ~ midpoint_sleep, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.18481 -0.22353 0.09046 0.31592 0.70754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.924867 0.095216 41.221 < 2e-16 ***
## midpoint_sleep -0.001152 0.000235 -4.903 1.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4298 on 632 degrees of freedom
## Multiple R-squared: 0.03664, Adjusted R-squared: 0.03512
## F-statistic: 24.04 on 1 and 632 DF, p-value: 1.202e-06
##
## Call:
## lm(formula = term_gpa ~ midpoint_sleep, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9260 -0.2121 0.1095 0.3462 0.7506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9834443 0.1088321 36.602 < 2e-16 ***
## midpoint_sleep -0.0013390 0.0002686 -4.986 7.97e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4913 on 632 degrees of freedom
## Multiple R-squared: 0.03785, Adjusted R-squared: 0.03632
## F-statistic: 24.86 on 1 and 632 DF, p-value: 7.972e-07
The above scatterplot illustrates the relationship between midpoint_sleep (x-axis, measured in minutes after 11 PM) and GPA with two regression lines representing the student’s term GPA using red and their cumulative GPA using blue. The negative slope of both regression lines suggests an inverse relationship between midpoint_sleep and GPA. Specifically, students with later sleep midpoints (e.g., closer to 700 minutes, or roughly 11 AM) tend to have lower GPAs, while those with earlier sleep midpoints (e.g., closer to 300 minutes, or roughly 4 AM) tend to have higher GPAs.
The shaded regions around the lines represent the confidence intervals, which show the uncertainty in the estimated relationships. Both groups show similar downward trends, which implies that the effect of a later midpoint_sleep on GPA is similar for term GPA and cumulative GPA. However, the clustering of points near earlier midpoints (around 300–500 minutes) indicates that most students have midpoints close to early morning, and outliers with much later sleep midpoints might disproportionately influence this trend.
We can also see in the regression summary that the slopes for both lines are significantly different from 0, given that the p-values are small. We can compare that to the regression summaries included below for Total Sleep Time and Day Time Sleep, which holistically have p-values that are not as strong in suggesting a significant difference in slope from 0.
In summary, the plot highlights that later sleep schedules, on average, are associated with poorer academic performance. This may reflect the importance of earlier sleep for maintaining academic success. We could consider that those that have later midpoints for sleep may be skipping their morning classes, although this is just speculation.
##
## Call:
## lm(formula = cum_gpa ~ daytime_sleep, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.29167 -0.22670 0.08518 0.30575 0.70362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5597544 0.0310900 114.498 < 2e-16 ***
## daytime_sleep -0.0022874 0.0006289 -3.637 0.000299 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4334 on 632 degrees of freedom
## Multiple R-squared: 0.0205, Adjusted R-squared: 0.01895
## F-statistic: 13.23 on 1 and 632 DF, p-value: 0.0002985
##
## Call:
## lm(formula = term_gpa ~ daytime_sleep, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.98148 -0.22793 0.09659 0.35140 0.87622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5647020 0.0355053 100.399 < 2e-16 ***
## daytime_sleep -0.0027962 0.0007183 -3.893 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.495 on 632 degrees of freedom
## Multiple R-squared: 0.02342, Adjusted R-squared: 0.02187
## F-statistic: 15.16 on 1 and 632 DF, p-value: 0.0001096
##
## Call:
## lm(formula = cum_gpa ~ TotalSleepTime, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28977 -0.23540 0.09437 0.31254 0.63302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0882676 0.1362555 22.665 <2e-16 ***
## TotalSleepTime 0.0009497 0.0003402 2.792 0.0054 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4352 on 632 degrees of freedom
## Multiple R-squared: 0.01218, Adjusted R-squared: 0.01062
## F-statistic: 7.794 on 1 and 632 DF, p-value: 0.0054
##
## Call:
## lm(formula = term_gpa ~ TotalSleepTime, data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.97642 -0.20934 0.09687 0.36328 0.76631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6610500 0.1535747 17.327 < 2e-16 ***
## TotalSleepTime 0.0019846 0.0003834 5.176 3.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4906 on 632 degrees of freedom
## Multiple R-squared: 0.04067, Adjusted R-squared: 0.03916
## F-statistic: 26.8 on 1 and 632 DF, p-value: 3.043e-07
## `geom_smooth()` using formula = 'y ~ x'
The above scatterplot examines the relationship between Midpoint Sleep (x-axis, in minutes after 11 PM) and Term GPA while differentiating by gender (0 = Male, 1 = Female). Each gender group is represented by separate regression lines and colors (purple for males and yellow for females), with a small number of “NA” values also included.
The plot shows that, for males, there is a stronger negative correlation between midpoint sleep and GPA, as indicated by the steeper downward slope of their regression line. This suggests that later midpoints of sleep are more strongly associated with lower term GPAs for male students. For females, the regression line has a much flatter slope, indicating a weaker relationship between midpoint sleep and GPA. Female students appear less affected by later sleep midpoints, as their GPA remains relatively stable across the range of midpoint sleep values. Visually, this may be because it seems less female students have extreme later midpoint sleeps compared to male students, although the difference is not significant.
Most data points are concentrated between 300 and 500 minutes (approximately 4 to 8 AM), suggesting this is the typical range for midpoint sleep. As mentioned before, the outliers with later sleep midpoints likely influence the trends. Overall, this graph highlights potential gender differences in how sleep timing impacts academic performance, with males showing greater sensitivity to late sleep patterns.
## `geom_smooth()` using formula = 'y ~ x'
The scatterplot examines the relationship between Midpoint Sleep and Term GPA, with separate regression lines and colors for students categorized by race (0 = underrepresented, 1 = non-underrepresented). Underrepresented students are shown in green, and non-underrepresented students are shown in pink.
The negative slopes of the regression lines indicate that, for both groups, later midpoint sleep times (e.g., closer to 700 minutes or around 11:40 AM) are associated with lower GPAs. However, the regression line for underrepresented students has an ever-so-slightly steeper slope compared to non-underrepresented students. This suggests that underrepresented students may experience a very slightly stronger negative impact of late sleep schedules on academic performance. However, the difference is far from major.
Additionally, the non-underrepresented group appears to have slightly higher GPAs on average, as indicated by the higher placement of their regression line relative to the underrepresented group. Overall, the graph highlights an inverse relationship between sleep timing and GPA across both groups, and suggests slight differences in the magnitude of the impact.
There are three distinct groups stand out among the clusters: red,
green, and blue. The red group is the most unique, representing students
who might have unpredictable sleep schedules, with inconsistent
bedtimes, lower GPAs, or unusual course loads that are either much
higher or lower than average. In contrast, the blue group includes
students who seem to have steady sleep routines, perform well
academically, and manage their course loads effectively. The green group
sits somewhere in the middle, with students who have somewhat consistent
sleep patterns and average GPAs and workloads.
The height of the branches in the dendrogram tells us even more. The red group is clearly different—it doesn’t merge with the other groups until much higher up in the tree, showing how distinct these students are in their habits and outcomes. Meanwhile, the green and blue groups come together much sooner, suggesting they have more in common with each other.
What this visualization really drives home is the connection between sleep, GPA, and course load. Students with erratic sleep schedules often face more academic challenges or take on unusual workloads, while those with consistent sleep habits seem to strike a better balance, achieving stronger academic results.
This heat map shows the correlations between key variables: bedtime
variability, cumulative GPA, total sleep time, and Z-scored course load.
The intensity of the colors represents the strength and direction of the
relationships, with red indicating a strong positive correlation and
blue indicating a negative one. One of the most noticeable patterns is
the strong positive correlation between Z-scored course load and
cumulative GPA, suggesting that students taking heavier course loads
tend to have higher GPAs. In contrast, bedtime variability is negatively
correlated with total sleep time, which may be because when bedtime
varies significantly from night to night, it disrupts your circadian
rhythm and may result in individuals not being able to sleep for as long
as they would like. In contrast, individuals with more consistent
bedtimes (lower bedtime_mssd) are probably more likely to have a stable
routine that leads to longer TotalSleepTime. Interestingly, however,
overall bedtime_mssd shows only weak or negligible correlations with the
other variables, which may suggest consistency in your sleep is not that
important.
## `geom_smooth()` using formula = 'y ~ x'
This scatterplot shows the relationship between bedtime variability (measured as mssd) and cumulative GPA. Despite a wide range of bedtime variability, there appears to be little to no consistent relationship, as indicated by the flat trendline. This supplements that we concluded from the previous graph and suggests that other factors in the dataset, such as total sleep time or workload, may play a more significant role in GPA outcomes.
This boxplot compares total sleep time (in minutes) across quartiles of
course load (Z-scored units). Each quartile groups students by their
relative workload, with Quartile 1 representing the lightest loads and
Quartile 4 the heaviest. The similar distributions across quartiles
suggest that students manage to maintain relatively consistent sleep
durations regardless of workload intensity.
This violin plot illustrates the distribution of total sleep time (in
minutes) by gender, where 0 represents males and 1 represents females.
Female students tend to have a slightly broader distribution of sleep
times, while male students show a more concentrated pattern. The
visualization highlights gender-based differences in sleep habits,
possibly linked to variations in time management or academic
pressures.
## `geom_smooth()` using formula = 'y ~ x'
This faceted scatterplot compares the relationship between total sleep
time and GPA for males (0) and females (1). Both genders exhibit a
slight positive correlation between sleep time and GPA, though the trend
is not strong. This plot suggests that while getting adequate sleep may
support academic performance, its effect is consistent across
genders.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(demo_gender) 1 2622 2622 1.018 0.313
## Residuals 629 1620068 2576
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(course_load_quartile) 3 11017 3672 1.437 0.231
## Residuals 483 1234703 2556
## Df Sum Sq Mean Sq F value Pr(>F)
## bedtime_mssd 1 0.0 0.00439 0.023 0.88
## Residuals 632 121.2 0.19177
The ANOVA results show that total sleep time doesn’t really change much based on gender (p = 0.313) or course load quartiles (p = 0.231). Similarly, GPA doesn’t seem to vary significantly across the different levels of bedtime variability (p = 0.216). This suggests that things like gender, workload, or bedtime consistency might not play as big a role in sleep and academic performance as we’d expect. It’s possible that other factors, like study habits or outside responsibilities, are making a bigger impact.
The current dataset lacks detailed information on these external factors. Incorporating this kind of data would require additional surveys or access to external databases. We would probably have to collect a whole new dataset that accounts for these factors, because it would be difficult to track down the exact participants used in the original dataset and ask these questions.
This study focused on data from a single semester for each participant (although cumulative GPA was taken into account), which limited our ability to assess long-term trends.
This study was observational in nature, and we would have to do an experiment with controlled trials to evaluate any kind of causal effect resulting from such interventions.