Whether it be in university or high school, students are constantly evaluated on their knowledge and learning abilities through a variety of form factors such as tests or projects. These examinations can have crucial future implications on the lives of many students and it’s important to identify and pinpoint what can help a student perform better on such examinations. Throughout this study we will be using and evaluating different performance and demographic information to see if we can uncover relationships among these variables that could give a student insight to what is affecting their performance, either positively or negatively. This can prove to be a crucial advantage for any student looking to improve their academic performance.
Below is a list of all the variables available in the dataset, along with a brief descrption of each:
Hours_Studied
: Number of hours spent studying per
weekAttendance
: Percentage of classes attendedParental_Involvement
: Level of parental involvement in
student’s education (Low, Medium, High)Access_To_Resources
: Availability of educational
resources (Low, Medium, High)Extracurricular_Activities
: Participation in
extracurricular activities (Yes, No)Sleep_Hours
: Average number of sleep hours per
nightPrevious_Scores
: Scores from previous ExamsMotivation_Level
: Student’s level of motivation (Low,
Medium, High)Internet_Access
: Availability of internet access (Yes,
No)Tutoring_Sessions
: Number of tutoring sessions attended
per monthFamily_Income
: Family income level (Low, Medium,
High)Teacher_Quality
: Quality of the teachers (Low, Medium,
High)School_Type
: Type of school attended (Public,
Private)Peer_Influence
: Influence of peers on academic
performance (Positive, Neutral, Negative)Physical_Activity
: Average number of hours of physical
activity per weekLearning_Disabilities
: Presence of learning
disabilities (Yes, No)Parental_Education_Level
: Highest education level of
parents (High School, College, Postgraduate)Distance_From_Home
: Distance from home to school (Near,
Moderate, Far)Gender
: Gender of the student (Male, Female)Exam_Score
: Final exam scoreHere we will outline our three main questions we will attempt to answer throughout the analysis
How are the quantitative variables related to each other and can we use them to successfully predict exam score.
How do external factors and lifestyle choices influence a student’s academic performance?
What are the impacts of the level of parental involvement, and level of access to resources on exam scores? Does a student’s exam score also depend on whether they go to a public or private school?
In the following section, we will explore the relationship between
five key quantitative variables (Hours_Studied
,
Attendance
, Sleep_Hours
,
Previous_Scores
, Exam_Score
) and see if we can
identify one or multiple to use as predictors for exam score. In order
to get a basic understanding of potential relationships we will use a
correlogram to get a sense of correlations between the variables.
From our correlogram, we see that for the most part, the variables are not correlated with each other, the most notable interactions that we observe are between Exam_Score, Hours_Studied and Attendance. We see that we have a fairly strong correlation between these variables which can be backed by our intuition, this suggests there could be a positive impact between attending classes and studying to exam performance. We also see a very faintly positive correlation between Previous_Scores and Exam_Score, this also aligns with our intuition but surprisingly the correlation isn’t as positive as one might think. Now that we have some foundation knowledge to the nature of the relationships between these variables we will perform more in depth analysis to solidify our understanding and formally test these relationships
Below we show the results of running PCA on our quantitative subset, we are aiming to once again explore the linear relationship between our quantitative variables and see how they relate to exam score.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.3224 1.0183 1.0081 0.9777 0.49234
## Proportion of Variance 0.3497 0.2074 0.2032 0.1912 0.04848
## Cumulative Proportion 0.3497 0.5571 0.7603 0.9515 1.00000
Above we see a table summarizing the proportion of the data that can be explained by each principal components along with a scree or elbow plot. From the table we see that The first principal component only accounts for about 35% of the total variation in the data with the second and thrid principal component accounting for about 20% more respectively. We visualize the scree plot in order to give us a better understanding at how many of these principal components are “effective” in a sense. We see that we should only use the first 3 principal components to summarize the data because they fall just about the threshold set by our dotted line, this tells us that the remaining two principal components account for less than 1/5th of the total variation of the data which is why they aren’t deemed as effective.
Now, in order to further our understanding of the potential relationships between the variables, we will include a table that tells us how each variable relates to each principal component
## PC1 PC2 PC3
## Hours_Studied 0.4199049 0.521529444 0.31559583
## Attendance 0.5409271 -0.585364782 -0.09181658
## Sleep_Hours -0.0262893 0.045312269 0.84845672
## Previous_Scores 0.1659517 0.619089370 -0.41467370
## Exam_Score 0.7091168 -0.005500975 0.01165815
From our scree plot, we will only be looking at the first 3 principal
components as these are the ones that most effectively summarize the
data. We see from the table that we observe a very high positive
coefficient for PC1 and Exam_Score
, this indicates that a
higher value of this PC is associated with a high value for
Exam_Score
. Continuing to look at PC1 we see that we
observe high positive values for Attendance
and
Hours_Studied
, this backs up what we saw in our correlogram
about these variables being correlated with one another, in terms of
Previous_Scores
we see a low positive value in PC1
indicating at some relationship with Exam_Score
but
minimal. Looking at PC2 we observe high positive coefficients for
Hours_Studied
and Previous_Scores
indicating
some relationship between these two variables along the second principal
component. Looking at the table as a whole, we see that
Sleep_Hours
doesn’t seem to demonstrate any relationship
with the other quantitative variables.
We will finalize our analysis of these variables by visualizing a PCA Biplot which is shown below
From our biplot, we see results that back up what we’ve observed with
our previous two methods. We observe a relatively strong positive
relationship between Previous_Scores
and
Hours_Studied
, this indicates that students who spend more
time studying demonstrated higher scores on previous exams, this aligns
with our intuition and also with what we saw in the PC table above. In
regards to Exam_Score
, we see weaker positive relationships
with Hours_Studied
and Attendance
, this
implies that a student who shows higher percentages of attendance and
more hours studying will perform higher on exams. One interesting thing
we can see from this plot is the relationship between
Sleep_Hours
and Attendance
, the plot shows us
that there is a weak negative relationship between these variables which
implies that students who sleep more show lower attendance percentages,
this makes sense if when you think about the fact that higher sleep
hours could mean that students are sleeping through their classes in
order to catch up on sleep.
Having a solid understanding of how these variables are related with
each other and Exam_Score
, we will now formally test these
relationships using linear regression models. We will fit 3 different
models in an attempt to accurately and effectively predict
Exam_Score
using the two variables we identified as having
the strongest relationship with Exam_Score
Our first model will be predicting Exam_Score
using
Attendance
as the sole covariate
##
## Call:
## lm(formula = Exam_Score ~ Attendance, data = student_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0443 -1.8062 -0.1766 1.5007 31.1726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.578577 0.272628 189.19 <2e-16 ***
## Attendance 0.195769 0.003374 58.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.166 on 6605 degrees of freedom
## Multiple R-squared: 0.3376, Adjusted R-squared: 0.3375
## F-statistic: 3367 on 1 and 6605 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'
Above we see a table summarizing the results of our first model, along with a graph visualizing the data with the least squares estimator shown in blue. From the table we see that our estimate for \(\beta_{Attendance}\) is in fact significant at the \(\alpha\) = 0.05 level, we can interpret the coefficient \(\beta_{Attendance}\) as the following:
0.196 is the expected change in Exam_Score
for two students whose Attendance
happens to
differ by one.
While the coefficient is not very high, this model still suggests
there to be a positive relationship between Attendance
and
Exam_Score
. If we look at the \(R^2\) value, we see it is 0.33, this means
that about 33% of the variability in the independent variable can be
explained by the dependent variable. This value being quite low
indicates that there are other factors outside of
Attendance
that could be influencing
Exam_Score
, we will make more models in an attempt to
address this later on.
Our second model will be the same as the first except this time using ’Hours_Studied` as the covariate
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = student_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.532 -2.243 -0.111 2.046 33.493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.456984 0.149196 411.92 <2e-16 ***
## Hours_Studied 0.289291 0.007154 40.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.483 on 6605 degrees of freedom
## Multiple R-squared: 0.1984, Adjusted R-squared: 0.1983
## F-statistic: 1635 on 1 and 6605 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'
From our summary table, we once again observe that our estimate for
\(\beta_{Hours_Studied}\) is
significant at the \(alpha\) = 0.05
level, we see that the value for the coefficient is 0.289, indicating a
stronger positive relationship between Hours_Studied
and
Exam_Score
. This stronger relationship can be visualized
with our graph as we see that the estimate for the least squares
estimator has a more positive slope than the first model. Our
interpretation for our estimate of \(\beta_{Hours_Studied}\) is as follows:
0.289 is the expected change in Exam_Score
for two students who happen to differ in
Hours_Studeid
by one.
This relationship suggests that students who spend more hours studying will perform better on exams, this once again aligns with our intuition but it’s still useful to formally test our intuition and see it backed up. If we look at the \(R^2\) value, we see it’s lower than our previous model at 0.198, this suggests that there are other factors that are accounting for much more of the variability in our independent variable.
In an attempt to find the best relationship between
Exam_Score
and our chosen quantitative variables, our next
model will be fit using both Attendance
and
Hours_Studied
as covariates. We hope that using both of
these variables as covariates will help address the lower \(R^2\) values we observed in our first two
models and account for more of the variation in
Exam_Score
.
##
## Call:
## lm(formula = Exam_Score ~ Hours_Studied * Attendance, data = student_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0118 -1.3290 -0.1687 1.0450 31.5866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.693e+01 7.788e-01 60.253 < 2e-16 ***
## Hours_Studied 2.266e-01 3.742e-02 6.055 1.49e-09 ***
## Attendance 1.808e-01 9.624e-03 18.781 < 2e-16 ***
## Hours_Studied:Attendance 8.308e-04 4.628e-04 1.795 0.0727 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.635 on 6603 degrees of freedom
## Multiple R-squared: 0.5415, Adjusted R-squared: 0.5413
## F-statistic: 2599 on 3 and 6603 DF, p-value: < 2.2e-16
Note: We’ve included an interaction term between
Hours_Studied and Attendance, in order to see if the relationship
between Exam_Score
and one of our covariates could be
explained by the value of the other covariate.
We see from the summary table that we once again observe positive
significant values for our estimated coefficients for
Hours_Studied
and Attendance
, they appear to
be about equal to the previous models’ values of around 0.28 and 0.19
respectively. One interesting thing we note from our model is that our
\(R^2\) value has indeed increased to
be around 0.54, this backs up our claim from before that adding both of
these covariates together would account for more of the variation in our
dependent variable. Although the value has increased it’s still not
great and tells us that a little under half of the variation in
Exam_Score
is explained by other factors.
Our interaction term has been deemed insignificant by our model, telling us that we don’t have enough evidence in the data to suggest that the effect of one independent variable on the dependent variable changes depending on the level of the other independent variable.
To conclude this section of our analysis, we will perform diagnostic tests on our best performing regression model from above (model_3) to see how well it fits the data by checking the necessary statistical assumptions.
To do this we will make two plots:
From the Residual vs. Fits plot we can assess the linearity and constant variance assumptions. We observe a constant/even spread of the points around the residual line, showing us that the linearity assumption is witheld by our model. We observe constant spread of points across the x-axis with no signs of a pattern or curvature telling us that the constant variance assumption is witheld as well.
Looking at the QQ-plot we do see that the sample quantiles line up very well with the theoretical quantiles up to a certain point, around 2.5 on the x-axis. After this, there is major divergence indicating the distribution of the errors begins to vary greatly after this point. Overall, we would say that the Gaussian assumption is held reasonably well, further analysis could be conducted in order to indentify this divergence in distribution.
External factors and lifestyle choices play a critical role in
shaping students’ academic outcomes. For this part of the project, we
focus on analyzing how the interplay of the following variables:
Extracurricular_Activities
, Physical_Activity
,
Tutoring_Sessions
, Parental_Involvement
, and
Peer_Influence
impacts exam scores. By leveraging
statistical modeling, clustering, and visualizations, we aim to uncover
patterns and relationships that help address this question.
An interaction plot visualizes the effect of physical activity hours on exam scores, modulated by whether or not the students participate in extracurricular activities or no.
The interaction plot illustrates the relationship between
Physical_Activity
, Extracurricular_Activities
,
and Exam_Score
, providing insights into how these external
factors influence academic performance. The x-axis represents the levels
of physical activity, while the y-axis shows the average exam scores.
Two lines depict the students who participate in extracurricular
activities (orange dashed line) and those who do not (blue solid
line).
From the graph, it is evident that exam scores increase as physical activity levels rise, regardless of extracurricular participation. However, students who engage in extracurricular activities consistently outperform those who do not, as shown by the higher position of the orange line across all levels of physical activity. Moreover, the effect of physical activity on exam scores is more pronounced for students involved in extracurricular activities, as indicated by the steeper slope of the orange line compared to the blue line. This suggests that extracurricular involvement amplifies the positive impact of physical activity on academic outcomes.
These findings highlight the independent and combined benefits of physical activity and extracurricular activities. While physical activity alone has a positive influence, the addition of extracurricular engagement further boosts exam performance. This interaction is particularly relevant for designing balanced academic and extracurricular programs that maximize student success.
The elbow plot and clustering visualization explore the relationship between tutoring sessions and exam scores, providing insights into how students’ engagement in tutoring correlates with their academic performance. The elbow plot shows the total within-cluster sum of squares (WSS) as the number of clusters (k) increases. The “elbow” point at k = 3 suggests that three clusters are the most meaningful grouping for the data, balancing simplicity and explanatory power. The clustering plot reveals these three clusters, with the x-axis representing the number of tutoring sessions and the y-axis representing exam scores, while colors indicate cluster membership.
The clustering results identify three distinct student groups. Cluster 1 includes students who attend a high number of tutoring sessions (3–8) and achieve varied exam scores, ranging from low to moderate (60–80). Cluster 2 comprises students who attend few or no tutoring sessions (0–2) yet consistently achieve high scores (80–100), suggesting these students may benefit from strong intrinsic motivation or external support, such as parental involvement or effective self-study habits. It may also suggest that 1-2 sessions of tutoring may be the ideal sweet spot for maximizing exam scores. However, Cluster 3 highlights students who attend minimal tutoring sessions (0–2) and also consistently achieve low scores (55–70), representing a group that may require targeted academic support or additional resources. These findings indicate that tutoring sessions alone do not uniformly enhance exam performance; other factors such as motivation, learning strategies, and external support likely play a role in academic success. However, it could still suggest minital external tutoring sessions would still be beneficial for student performance when it comes to exam score.
## `summarise()` has grouped output by 'Peer_Influence'. You can override using
## the `.groups` argument.
This heat map explores the relationship between Peer Influence, Parental Involvement, and Average Exam Scores, providing insights into how these external factors jointly impact academic performance. The x-axis represents the levels of peer influence (Negative, Neutral, Positive), while the y-axis represents the levels of parental involvement (Low, Medium, High). The color gradient, ranging from blue to red, indicates the average exam scores, with red representing higher scores and blue representing lower scores.
The heat map reveals several important patterns. Exam scores are highest when both parental involvement and peer influence are at their highest levels, as represented by the bright red tile in the bottom-right corner (High Parental Involvement and Positive Peer Influence). Conversely, the lowest exam scores occur when both factors are at their lowest levels, shown by the blue tile in the middle-left corner (Low Parental Involvement and Negative Peer Influence). This highlights a clear positive correlation between supportive environments (both parental and peer-based) and academic performance. Interestingly, even in cases where peer influence is positive, low levels of parental involvement result in only moderate scores, emphasizing the importance of parental engagement in students’ academic success.
Next, we aimed to explore the effect of a student’s background (which includes the level of parental involvement, access to resources, and the type of school that the student went to), on a student’s exam performance.
Dendrogram
We created a dendrogram with complete linkage based on the students’ exam scores. We divided the branch into 3 different clusters, as there are 3 distinct levels of parental involvement. By principle, branches of the same colour represent students that are more similar in terms of exam scores than branches of different colours. Among the three clusters, the red cluster is significantly smaller than the green and blue clusters, and the green cluster is observably larger than the blue cluster.
The dendrogram cluster labels below the dendrogram were coloured by the level of parental involvement in the following way: low (purple), medium (orange), and high (green). It can be observed that all the clusters comprised of the 3 labels, with no label being more prominent in a specific cluster than another.
Thus, we can conclude that the level of parental involvement does not have a significant effect on the performance of students in exams.
Next, we wanted to find the relationship between a student’s study habits, more specifically the hours studied, and whether this was dependent on their access to resources, so we plotted a violin plot with embedded box plots across three levels that described a student’s access to resources: high, low, and medium. A puzzling observation we found here was how similar the median study hours were across all three groups, which was at about 20 hours. Each distribution shows a comparable spread from approximately 10 to 30 hours, with symmetrical patterns.
Looking at group-specific patterns, students with high access to resources show slightly more variability in their study hours, particularly in the lower quartile, while those with low access demonstrate a more concentrated distribution around the median with a more pronounced peak. All groups show notable outliers at both extremes, with some students studying close to 0 hours and others studying close to 40 hours.
The similarity in median study hours across resource levels suggests that access to resources may not be a strong determinant of how much time students dedicate to studying. The consistent patterns across all three groups indicate that study time might be more influenced by personal habits, motivation, or other individual factors rather than access to resources. This finding is particularly interesting as it suggests that while resource access may impact other aspects of academic performance, it doesn’t appear to significantly affect study time allocation.
Since access to resources didn’t seem to have a significant effect on hours studied, we wanted to see whether the same would apply for the effect of access to resources on exam performance. Moreover, we also wanted to test whether the type of school that a student goes to, and parental involvement, affect exam performance.
The density plots reveal varying relationships between exam scores and three factors: parental involvement, access to resources, and school type. The exam scores generally range from 60 to 80 points across all categories, with very few students scoring above 85 regardless of their circumstances. Parental involvement shows some expected patterns, with high involvement demonstrating a slightly right shifted peak, while medium parental involvement displays a wider distribution, suggesting more variability in student outcomes, and low parent involvement shows a lower peak.
In terms of access to resources, there are subtle but meaningful differences in the distributions. Students with high access to resources tend to have a slightly right shifted distribution, indicating a slightly better exam performance, while those with low and medium resource access peak earlier around the 68 mark. The overlap between these distributions suggests that while resource access plays a role in academic performance, its is not as significant as one might expect it to be.
Our most interesting result came from the comparison of exam score distributions between the two types of school, where private and public school distributions are almost identical. Both school types show peaks around 67 points with nearly identical distribution shapes, suggesting that the type of school attended may have less influence on exam performance than other factors. This challenges common assumptions about private education advantages and indicates that other variables, such as parental involvement and resource access, might be more significant determinants of academic success.
##
## Welch Two Sample t-test
##
## data: Exam_Score by School_Type
## t = 0.72311, df = 3881.2, p-value = 0.4697
## alternative hypothesis: true difference in means between group Private and group Public is not equal to 0
## 95 percent confidence interval:
## -0.1279822 0.2775555
## sample estimates:
## mean in group Private mean in group Public
## 67.28771 67.21292
To confirm our observations regarding the effect of the school type on exam scores, we can use the Welch two sample t-test to confirm whether or not there exists a relationship between the type of school that a student goes to and their exam performance. The results confirm what we observed in the density plot: there is no statistically significant difference in exam scores between private and public schools, as the high p-value of 0.4697 is well above the significance level of 0.05, and the mean difference is only about 0.07 points. Thus, we can conclude that school type does not meaningfully impact student exam performance.
Throughout our analysis, in an attempt to understand what factors may
be the most important for a students academic success, we explored key
relationships between the variables in the data against each other and
with our overall variable of interest Exam_Score
.
In our exploration of the quantitative relationships we found
significant correlations between the variables Attendance
and Hours_Studied
with exam performance, our best fitting
model demonstrated that students who spend more time studying and have a
higher attendance percentage are predicted to perform better on exams.
These results are backed by our intuition but it’s important to formally
test these inferences and backing them up with statistically significant
findings allows us to have much better ideas of how students should be
spending their time in order to perform well on exams. One result from
this section of the analysis that was surprising was the fact that
Previous_Scores
was not found to be correlated with
Exam_Scores
meaning that we couldn’t use it to effectively
predict future exam performance. Our findings for this section can be
useful for a student looking for the most important areas to spend their
time.
In our analysis of how external factors and lifestyle choices may influence performance, we found many interesting results. We saw that physical activity paired with extracurricular activities was shown to have a positive impact on exam performance. An interesting relationship we discovered through cluster analysis was the relationship between the amount of tutoring sessions and exam performance, we saw that students who had more tutoring sessions tended to perform worse than those with less sessions. Further analysis could be conducted into that relationship to see what could be driving this finding, we hypothesize that the students with less sessions may have a higher intrinsic motivation to study and perform academically. We also saw that positive levels of parental involvement and peer influence had a beneficial impact on exam performance, with our findings in this section we are able to recommend lifestyle choices such as increased physical activity and involvement in extracurricular activities that may help them perform well on exams along with curating a positive peer and parental environment around them (to the best of their ability.)
In our final section we explored whether certain factors relating to a student’s background such as access to resources and type of school had an effect on performance. In our exploration of these relationships, we didn’t uncover any drastic or significant results, we saw some hint that students with more access to resources may perform slightly better but again these results were not deemed to be drastically different. One interesting result we found was that performance had no effect based on the type of school a student attended (Public vs. Private) this result is very interesting and future analysis with more data from other regions could be done to determine whether or not this is the truth. Our findings in this section allow us to say that a students economical background does not have a significant relationship to their performance, we find this to be a positive thing as education is not something that should be limited by financial challenges.
Some limitations we ran into throughout the analysis was the way in
which the data presented itself, for some quantitative variables such as
Sleep_Hours
we were given an average, we would’ve liked if
it would’ve been more continuous such as maybe measured in minutes so we
could have more variability in the data. We also feel that using the
metric of exam score isn’t always the most effective way of determining
a student’s academic performance, we would’ve liked to have had more
metrics of success such as GPA or maybe SAT scores. With those variables
we could’ve explored whether or not those metrics are related at all and
how they are effected by other factors.