Motivation

Whether it be in university or high school, students are constantly evaluated on their knowledge and learning abilities through a variety of form factors such as tests or projects. These examinations can have crucial future implications on the lives of many students and it’s important to identify and pinpoint what can help a student perform better on such examinations. Throughout this study we will be using and evaluating different performance and demographic information to see if we can uncover relationships among these variables that could give a student insight to what is affecting their performance, either positively or negatively. This can prove to be a crucial advantage for any student looking to improve their academic performance.

Overview

Below is a list of all the variables available in the dataset, along with a brief descrption of each:

  • Hours_Studied: Number of hours spent studying per week
  • Attendance: Percentage of classes attended
  • Parental_Involvement: Level of parental involvement in student’s education (Low, Medium, High)
  • Access_To_Resources: Availability of educational resources (Low, Medium, High)
  • Extracurricular_Activities: Participation in extracurricular activities (Yes, No)
  • Sleep_Hours: Average number of sleep hours per night
  • Previous_Scores: Scores from previous Exams
  • Motivation_Level: Student’s level of motivation (Low, Medium, High)
  • Internet_Access: Availability of internet access (Yes, No)
  • Tutoring_Sessions: Number of tutoring sessions attended per month
  • Family_Income: Family income level (Low, Medium, High)
  • Teacher_Quality: Quality of the teachers (Low, Medium, High)
  • School_Type: Type of school attended (Public, Private)
  • Peer_Influence: Influence of peers on academic performance (Positive, Neutral, Negative)
  • Physical_Activity: Average number of hours of physical activity per week
  • Learning_Disabilities: Presence of learning disabilities (Yes, No)
  • Parental_Education_Level: Highest education level of parents (High School, College, Postgraduate)
  • Distance_From_Home: Distance from home to school (Near, Moderate, Far)
  • Gender: Gender of the student (Male, Female)
  • Exam_Score: Final exam score

Research Questions

Here we will outline our three main questions we will attempt to answer throughout the analysis

  • How are the quantitative variables related to each other and can we use them to successfully predict exam score.

  • How do external factors and lifestyle choices influence a student’s academic performance?

  • What are the impacts of the level of parental involvement, and level of access to resources on exam scores? Does a student’s exam score also depend on whether they go to a public or private school?

Exploring Quantitative Relationships

In the following section, we will explore the relationship between five key quantitative variables (Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Exam_Score) and see if we can identify one or multiple to use as predictors for exam score. In order to get a basic understanding of potential relationships we will use a correlogram to get a sense of correlations between the variables.

Correlogram

From our correlogram, we see that for the most part, the variables are not correlated with each other, the most notable interactions that we observe are between Exam_Score, Hours_Studied and Attendance. We see that we have a fairly strong correlation between these variables which can be backed by our intuition, this suggests there could be a positive impact between attending classes and studying to exam performance. We also see a very faintly positive correlation between Previous_Scores and Exam_Score, this also aligns with our intuition but surprisingly the correlation isn’t as positive as one might think. Now that we have some foundation knowledge to the nature of the relationships between these variables we will perform more in depth analysis to solidify our understanding and formally test these relationships

PCA and PCA Biplot

Below we show the results of running PCA on our quantitative subset, we are aiming to once again explore the linear relationship between our quantitative variables and see how they relate to exam score.

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.3224 1.0183 1.0081 0.9777 0.49234
## Proportion of Variance 0.3497 0.2074 0.2032 0.1912 0.04848
## Cumulative Proportion  0.3497 0.5571 0.7603 0.9515 1.00000

Above we see a table summarizing the proportion of the data that can be explained by each principal components along with a scree or elbow plot. From the table we see that The first principal component only accounts for about 35% of the total variation in the data with the second and thrid principal component accounting for about 20% more respectively. We visualize the scree plot in order to give us a better understanding at how many of these principal components are “effective” in a sense. We see that we should only use the first 3 principal components to summarize the data because they fall just about the threshold set by our dotted line, this tells us that the remaining two principal components account for less than 1/5th of the total variation of the data which is why they aren’t deemed as effective.

Now, in order to further our understanding of the potential relationships between the variables, we will include a table that tells us how each variable relates to each principal component

##                        PC1          PC2         PC3
## Hours_Studied    0.4199049  0.521529444  0.31559583
## Attendance       0.5409271 -0.585364782 -0.09181658
## Sleep_Hours     -0.0262893  0.045312269  0.84845672
## Previous_Scores  0.1659517  0.619089370 -0.41467370
## Exam_Score       0.7091168 -0.005500975  0.01165815

From our scree plot, we will only be looking at the first 3 principal components as these are the ones that most effectively summarize the data. We see from the table that we observe a very high positive coefficient for PC1 and Exam_Score, this indicates that a higher value of this PC is associated with a high value for Exam_Score. Continuing to look at PC1 we see that we observe high positive values for Attendance and Hours_Studied, this backs up what we saw in our correlogram about these variables being correlated with one another, in terms of Previous_Scores we see a low positive value in PC1 indicating at some relationship with Exam_Score but minimal. Looking at PC2 we observe high positive coefficients for Hours_Studied and Previous_Scores indicating some relationship between these two variables along the second principal component. Looking at the table as a whole, we see that Sleep_Hours doesn’t seem to demonstrate any relationship with the other quantitative variables.

We will finalize our analysis of these variables by visualizing a PCA Biplot which is shown below

From our biplot, we see results that back up what we’ve observed with our previous two methods. We observe a relatively strong positive relationship between Previous_Scores and Hours_Studied, this indicates that students who spend more time studying demonstrated higher scores on previous exams, this aligns with our intuition and also with what we saw in the PC table above. In regards to Exam_Score, we see weaker positive relationships with Hours_Studied and Attendance, this implies that a student who shows higher percentages of attendance and more hours studying will perform higher on exams. One interesting thing we can see from this plot is the relationship between Sleep_Hours and Attendance, the plot shows us that there is a weak negative relationship between these variables which implies that students who sleep more show lower attendance percentages, this makes sense if when you think about the fact that higher sleep hours could mean that students are sleeping through their classes in order to catch up on sleep.

Linear Regression Analysis

Having a solid understanding of how these variables are related with each other and Exam_Score, we will now formally test these relationships using linear regression models. We will fit 3 different models in an attempt to accurately and effectively predict Exam_Score using the two variables we identified as having the strongest relationship with Exam_Score

Model 1

Our first model will be predicting Exam_Score using Attendance as the sole covariate

## 
## Call:
## lm(formula = Exam_Score ~ Attendance, data = student_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0443 -1.8062 -0.1766  1.5007 31.1726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 51.578577   0.272628  189.19   <2e-16 ***
## Attendance   0.195769   0.003374   58.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.166 on 6605 degrees of freedom
## Multiple R-squared:  0.3376, Adjusted R-squared:  0.3375 
## F-statistic:  3367 on 1 and 6605 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

Above we see a table summarizing the results of our first model, along with a graph visualizing the data with the least squares estimator shown in blue. From the table we see that our estimate for \(\beta_{Attendance}\) is in fact significant at the \(\alpha\) = 0.05 level, we can interpret the coefficient \(\beta_{Attendance}\) as the following:

0.196 is the expected change in Exam_Score for two students whose Attendance happens to differ by one.

While the coefficient is not very high, this model still suggests there to be a positive relationship between Attendance and Exam_Score. If we look at the \(R^2\) value, we see it is 0.33, this means that about 33% of the variability in the independent variable can be explained by the dependent variable. This value being quite low indicates that there are other factors outside of Attendance that could be influencing Exam_Score, we will make more models in an attempt to address this later on.

Model 2

Our second model will be the same as the first except this time using ’Hours_Studied` as the covariate

## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied, data = student_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.532 -2.243 -0.111  2.046 33.493 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   61.456984   0.149196  411.92   <2e-16 ***
## Hours_Studied  0.289291   0.007154   40.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.483 on 6605 degrees of freedom
## Multiple R-squared:  0.1984, Adjusted R-squared:  0.1983 
## F-statistic:  1635 on 1 and 6605 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

From our summary table, we once again observe that our estimate for \(\beta_{Hours_Studied}\) is significant at the \(alpha\) = 0.05 level, we see that the value for the coefficient is 0.289, indicating a stronger positive relationship between Hours_Studied and Exam_Score. This stronger relationship can be visualized with our graph as we see that the estimate for the least squares estimator has a more positive slope than the first model. Our interpretation for our estimate of \(\beta_{Hours_Studied}\) is as follows:

0.289 is the expected change in Exam_Score for two students who happen to differ in Hours_Studeid by one.

This relationship suggests that students who spend more hours studying will perform better on exams, this once again aligns with our intuition but it’s still useful to formally test our intuition and see it backed up. If we look at the \(R^2\) value, we see it’s lower than our previous model at 0.198, this suggests that there are other factors that are accounting for much more of the variability in our independent variable.

Model 3

In an attempt to find the best relationship between Exam_Score and our chosen quantitative variables, our next model will be fit using both Attendance and Hours_Studied as covariates. We hope that using both of these variables as covariates will help address the lower \(R^2\) values we observed in our first two models and account for more of the variation in Exam_Score.

## 
## Call:
## lm(formula = Exam_Score ~ Hours_Studied * Attendance, data = student_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0118 -1.3290 -0.1687  1.0450 31.5866 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.693e+01  7.788e-01  60.253  < 2e-16 ***
## Hours_Studied            2.266e-01  3.742e-02   6.055 1.49e-09 ***
## Attendance               1.808e-01  9.624e-03  18.781  < 2e-16 ***
## Hours_Studied:Attendance 8.308e-04  4.628e-04   1.795   0.0727 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.635 on 6603 degrees of freedom
## Multiple R-squared:  0.5415, Adjusted R-squared:  0.5413 
## F-statistic:  2599 on 3 and 6603 DF,  p-value: < 2.2e-16

Note: We’ve included an interaction term between Hours_Studied and Attendance, in order to see if the relationship between Exam_Score and one of our covariates could be explained by the value of the other covariate.

We see from the summary table that we once again observe positive significant values for our estimated coefficients for Hours_Studied and Attendance, they appear to be about equal to the previous models’ values of around 0.28 and 0.19 respectively. One interesting thing we note from our model is that our \(R^2\) value has indeed increased to be around 0.54, this backs up our claim from before that adding both of these covariates together would account for more of the variation in our dependent variable. Although the value has increased it’s still not great and tells us that a little under half of the variation in Exam_Score is explained by other factors.

Our interaction term has been deemed insignificant by our model, telling us that we don’t have enough evidence in the data to suggest that the effect of one independent variable on the dependent variable changes depending on the level of the other independent variable.

Regression Diagnostics

To conclude this section of our analysis, we will perform diagnostic tests on our best performing regression model from above (model_3) to see how well it fits the data by checking the necessary statistical assumptions.

To do this we will make two plots:

  1. Residuals vs. Fits plot
  • This plot will test the linearity and constant variance assumption of our linear model
  1. Normal QQ-plot
  • This plot will assess the Gaussian error assumption of our linear model

Diagnostic Plots

From the Residual vs. Fits plot we can assess the linearity and constant variance assumptions. We observe a constant/even spread of the points around the residual line, showing us that the linearity assumption is witheld by our model. We observe constant spread of points across the x-axis with no signs of a pattern or curvature telling us that the constant variance assumption is witheld as well.

Looking at the QQ-plot we do see that the sample quantiles line up very well with the theoretical quantiles up to a certain point, around 2.5 on the x-axis. After this, there is major divergence indicating the distribution of the errors begins to vary greatly after this point. Overall, we would say that the Gaussian assumption is held reasonably well, further analysis could be conducted in order to indentify this divergence in distribution.

External Factors and Student Performance

External factors and lifestyle choices play a critical role in shaping students’ academic outcomes. For this part of the project, we focus on analyzing how the interplay of the following variables: Extracurricular_Activities, Physical_Activity, Tutoring_Sessions, Parental_Involvement, and Peer_Influence impacts exam scores. By leveraging statistical modeling, clustering, and visualizations, we aim to uncover patterns and relationships that help address this question.

Interaction Plot

An interaction plot visualizes the effect of physical activity hours on exam scores, modulated by whether or not the students participate in extracurricular activities or no.

The interaction plot illustrates the relationship between Physical_Activity, Extracurricular_Activities, and Exam_Score, providing insights into how these external factors influence academic performance. The x-axis represents the levels of physical activity, while the y-axis shows the average exam scores. Two lines depict the students who participate in extracurricular activities (orange dashed line) and those who do not (blue solid line).

From the graph, it is evident that exam scores increase as physical activity levels rise, regardless of extracurricular participation. However, students who engage in extracurricular activities consistently outperform those who do not, as shown by the higher position of the orange line across all levels of physical activity. Moreover, the effect of physical activity on exam scores is more pronounced for students involved in extracurricular activities, as indicated by the steeper slope of the orange line compared to the blue line. This suggests that extracurricular involvement amplifies the positive impact of physical activity on academic outcomes.

These findings highlight the independent and combined benefits of physical activity and extracurricular activities. While physical activity alone has a positive influence, the addition of extracurricular engagement further boosts exam performance. This interaction is particularly relevant for designing balanced academic and extracurricular programs that maximize student success.

Elbow plot and cluster visualization

The elbow plot and clustering visualization explore the relationship between tutoring sessions and exam scores, providing insights into how students’ engagement in tutoring correlates with their academic performance. The elbow plot shows the total within-cluster sum of squares (WSS) as the number of clusters (k) increases. The “elbow” point at k = 3 suggests that three clusters are the most meaningful grouping for the data, balancing simplicity and explanatory power. The clustering plot reveals these three clusters, with the x-axis representing the number of tutoring sessions and the y-axis representing exam scores, while colors indicate cluster membership.

The clustering results identify three distinct student groups. Cluster 1 includes students who attend a high number of tutoring sessions (3–8) and achieve varied exam scores, ranging from low to moderate (60–80). Cluster 2 comprises students who attend few or no tutoring sessions (0–2) yet consistently achieve high scores (80–100), suggesting these students may benefit from strong intrinsic motivation or external support, such as parental involvement or effective self-study habits. It may also suggest that 1-2 sessions of tutoring may be the ideal sweet spot for maximizing exam scores. However, Cluster 3 highlights students who attend minimal tutoring sessions (0–2) and also consistently achieve low scores (55–70), representing a group that may require targeted academic support or additional resources. These findings indicate that tutoring sessions alone do not uniformly enhance exam performance; other factors such as motivation, learning strategies, and external support likely play a role in academic success. However, it could still suggest minital external tutoring sessions would still be beneficial for student performance when it comes to exam score.

Heat Map

## `summarise()` has grouped output by 'Peer_Influence'. You can override using
## the `.groups` argument.

This heat map explores the relationship between Peer Influence, Parental Involvement, and Average Exam Scores, providing insights into how these external factors jointly impact academic performance. The x-axis represents the levels of peer influence (Negative, Neutral, Positive), while the y-axis represents the levels of parental involvement (Low, Medium, High). The color gradient, ranging from blue to red, indicates the average exam scores, with red representing higher scores and blue representing lower scores.

The heat map reveals several important patterns. Exam scores are highest when both parental involvement and peer influence are at their highest levels, as represented by the bright red tile in the bottom-right corner (High Parental Involvement and Positive Peer Influence). Conversely, the lowest exam scores occur when both factors are at their lowest levels, shown by the blue tile in the middle-left corner (Low Parental Involvement and Negative Peer Influence). This highlights a clear positive correlation between supportive environments (both parental and peer-based) and academic performance. Interestingly, even in cases where peer influence is positive, low levels of parental involvement result in only moderate scores, emphasizing the importance of parental engagement in students’ academic success.

Student background and Performance

Next, we aimed to explore the effect of a student’s background (which includes the level of parental involvement, access to resources, and the type of school that the student went to), on a student’s exam performance.

Dendrogram

We created a dendrogram with complete linkage based on the students’ exam scores. We divided the branch into 3 different clusters, as there are 3 distinct levels of parental involvement. By principle, branches of the same colour represent students that are more similar in terms of exam scores than branches of different colours. Among the three clusters, the red cluster is significantly smaller than the green and blue clusters, and the green cluster is observably larger than the blue cluster.

The dendrogram cluster labels below the dendrogram were coloured by the level of parental involvement in the following way: low (purple), medium (orange), and high (green). It can be observed that all the clusters comprised of the 3 labels, with no label being more prominent in a specific cluster than another.

Thus, we can conclude that the level of parental involvement does not have a significant effect on the performance of students in exams.

Violin Plot

Next, we wanted to find the relationship between a student’s study habits, more specifically the hours studied, and whether this was dependent on their access to resources, so we plotted a violin plot with embedded box plots across three levels that described a student’s access to resources: high, low, and medium. A puzzling observation we found here was how similar the median study hours were across all three groups, which was at about 20 hours. Each distribution shows a comparable spread from approximately 10 to 30 hours, with symmetrical patterns.

Looking at group-specific patterns, students with high access to resources show slightly more variability in their study hours, particularly in the lower quartile, while those with low access demonstrate a more concentrated distribution around the median with a more pronounced peak. All groups show notable outliers at both extremes, with some students studying close to 0 hours and others studying close to 40 hours.

The similarity in median study hours across resource levels suggests that access to resources may not be a strong determinant of how much time students dedicate to studying. The consistent patterns across all three groups indicate that study time might be more influenced by personal habits, motivation, or other individual factors rather than access to resources. This finding is particularly interesting as it suggests that while resource access may impact other aspects of academic performance, it doesn’t appear to significantly affect study time allocation.

Density Plots

Since access to resources didn’t seem to have a significant effect on hours studied, we wanted to see whether the same would apply for the effect of access to resources on exam performance. Moreover, we also wanted to test whether the type of school that a student goes to, and parental involvement, affect exam performance.

The density plots reveal varying relationships between exam scores and three factors: parental involvement, access to resources, and school type. The exam scores generally range from 60 to 80 points across all categories, with very few students scoring above 85 regardless of their circumstances. Parental involvement shows some expected patterns, with high involvement demonstrating a slightly right shifted peak, while medium parental involvement displays a wider distribution, suggesting more variability in student outcomes, and low parent involvement shows a lower peak.

In terms of access to resources, there are subtle but meaningful differences in the distributions. Students with high access to resources tend to have a slightly right shifted distribution, indicating a slightly better exam performance, while those with low and medium resource access peak earlier around the 68 mark. The overlap between these distributions suggests that while resource access plays a role in academic performance, its is not as significant as one might expect it to be.

Our most interesting result came from the comparison of exam score distributions between the two types of school, where private and public school distributions are almost identical. Both school types show peaks around 67 points with nearly identical distribution shapes, suggesting that the type of school attended may have less influence on exam performance than other factors. This challenges common assumptions about private education advantages and indicates that other variables, such as parental involvement and resource access, might be more significant determinants of academic success.

## 
##  Welch Two Sample t-test
## 
## data:  Exam_Score by School_Type
## t = 0.72311, df = 3881.2, p-value = 0.4697
## alternative hypothesis: true difference in means between group Private and group Public is not equal to 0
## 95 percent confidence interval:
##  -0.1279822  0.2775555
## sample estimates:
## mean in group Private  mean in group Public 
##              67.28771              67.21292

To confirm our observations regarding the effect of the school type on exam scores, we can use the Welch two sample t-test to confirm whether or not there exists a relationship between the type of school that a student goes to and their exam performance. The results confirm what we observed in the density plot: there is no statistically significant difference in exam scores between private and public schools, as the high p-value of 0.4697 is well above the significance level of 0.05, and the mean difference is only about 0.07 points. Thus, we can conclude that school type does not meaningfully impact student exam performance.

Conclusion

Throughout our analysis, in an attempt to understand what factors may be the most important for a students academic success, we explored key relationships between the variables in the data against each other and with our overall variable of interest Exam_Score.

In our exploration of the quantitative relationships we found significant correlations between the variables Attendance and Hours_Studied with exam performance, our best fitting model demonstrated that students who spend more time studying and have a higher attendance percentage are predicted to perform better on exams. These results are backed by our intuition but it’s important to formally test these inferences and backing them up with statistically significant findings allows us to have much better ideas of how students should be spending their time in order to perform well on exams. One result from this section of the analysis that was surprising was the fact that Previous_Scores was not found to be correlated with Exam_Scores meaning that we couldn’t use it to effectively predict future exam performance. Our findings for this section can be useful for a student looking for the most important areas to spend their time.

In our analysis of how external factors and lifestyle choices may influence performance, we found many interesting results. We saw that physical activity paired with extracurricular activities was shown to have a positive impact on exam performance. An interesting relationship we discovered through cluster analysis was the relationship between the amount of tutoring sessions and exam performance, we saw that students who had more tutoring sessions tended to perform worse than those with less sessions. Further analysis could be conducted into that relationship to see what could be driving this finding, we hypothesize that the students with less sessions may have a higher intrinsic motivation to study and perform academically. We also saw that positive levels of parental involvement and peer influence had a beneficial impact on exam performance, with our findings in this section we are able to recommend lifestyle choices such as increased physical activity and involvement in extracurricular activities that may help them perform well on exams along with curating a positive peer and parental environment around them (to the best of their ability.)

In our final section we explored whether certain factors relating to a student’s background such as access to resources and type of school had an effect on performance. In our exploration of these relationships, we didn’t uncover any drastic or significant results, we saw some hint that students with more access to resources may perform slightly better but again these results were not deemed to be drastically different. One interesting result we found was that performance had no effect based on the type of school a student attended (Public vs. Private) this result is very interesting and future analysis with more data from other regions could be done to determine whether or not this is the truth. Our findings in this section allow us to say that a students economical background does not have a significant relationship to their performance, we find this to be a positive thing as education is not something that should be limited by financial challenges.

Some limitations we ran into throughout the analysis was the way in which the data presented itself, for some quantitative variables such as Sleep_Hours we were given an average, we would’ve liked if it would’ve been more continuous such as maybe measured in minutes so we could have more variability in the data. We also feel that using the metric of exam score isn’t always the most effective way of determining a student’s academic performance, we would’ve liked to have had more metrics of success such as GPA or maybe SAT scores. With those variables we could’ve explored whether or not those metrics are related at all and how they are effected by other factors.