Introduction

The data we are exploring for this research project gives information on student achievement across two Portuguese secondary education schools. Through school reports and questionnaires, the data compiles information of 395 observations across 33 different variables, where each student is an observation.

Regarding student performance, we have grades on both the Portuguese language class and the math class. For both of these schools, we have grade information for first, second, and third periods.

Our goal for this research project is to use this dataset to better understand the impact of these students’ behaviors and their support systems, on their grade. To do this, we explore two student behaviors: attending school, and consuming alcohol. We also explore two support systems: the parents of the students (their jobs and educations) and the amount of educational support the family and school provide to different students. Together we compare these different variables to either the student’s absences, failures, or grades. These comparisons allow us to see the correlation of the different variables to the performance indicators, and further enable us to better understand some of the sources contributing to student performance. We do that by asking the following questions:

Question 1

We start off by looking at the impact of students decisions on their performance. So, one of the questions we want to answer is if students who have higher workday alcohol consumption have a higher number of school absences.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  absences by Dalc
## Bartlett's K-squared = 53.081, df = 4, p-value = 8.194e-11
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  absences and Dalc
## F = 3.7754, num df = 4.000, denom df = 53.178, p-value = 0.008936
## 
##  Welch Two Sample t-test
## 
## data:  absences by highDalc
## t = 2.3115, df = 35.188, p-value = 0.02678
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2897818 4.4638289
## sample estimates:
## mean in group highAlc  mean in group lowAlc 
##              5.911765              3.534959

It seems that there is an increase in school absences the more often the student drinks alcohol on workdays. We first want to see if the average amount of absences varies across workday alcohol consumption. We perform a Bartlett test and a one way test. The Bartlett test gives us back a p-value of \(8.194*10^{-11}\). With this p value, there is sufficient evidence at \(\alpha = .05\) to reject the null hypothesis that the variance of absences for each of the Workday Alcohol consumption values are all the same. The one way test gives us a p value of \(0.008936\). With this p value, there is sufficient evidence at \(\alpha = .05\) to reject the null hypothesis that the mean of absences for each of the workday alcohol consumption values are all the same. Let’s try to see if we can determine if students who drink high amounts of alcohol during the work week have higher absences on average than students who do not on. We will classify students with high amounts of workday alcohol consumption as any student who recorded a value of \(4\) or higher. Otherwise the student will be classified as low amounts of workday alcohol consumption, that is a value of \(3\) or lower. The average amount of absences for students in the high alcohol consumption category is \(5.911765\) while for students in the low alcohol consumption category it is \(3.534959\). Let’s see if this difference is significant, that is different from \(0\). We perform a t test on these groups and get a p value of \(0.02678\). With this p value, it follows that at \(\alpha = .05\), there is sufficient evidence to reject the null hypothesis that the difference in means is \(0\). It follows that there is an association of higher absences with higher alcohol consumption during the work week.

Question 2

Moreover, we wanted to learn whether more absences were correlated or associated with lower final grades, which suggests we should examine the variables absences and G3 using a contour plot. This contour plot is informative since it visualizes the joint distribution of the absences and the final grades of Portugese high school students along with the corresponding 2D kernel density estimations.

## [1] -0.09137906

From the contour plot, we can infer that the mode of the joint distribution of the absences and the final grades of Portugese high school students appears to be located at ~1 absence and ~12.5 out of 20 on the grade scale - as denoted by the inner contour lines which represent the “peaks” of the 2D kernel density estimations. It is also vital to note that the contour lines do not appear to encapsulate the datapoints located beyond ~12 absences. We can also conclude that the joint distribution of the absences and the final grades of Portugese high school students appears to be very tightly clustered in the approximate range of ~0 to ~2 absences and ~11 to ~14 out of 20 on the grade scale, fairly tightly clustered in the approximate range of ~2 to ~4 absences and ~10 to ~11 & ~14 to ~15 out of 20 on the grade scale, and relatively less tightly clustered in the approximate range of ~4 to ~5 absences and ~9 to ~10 & ~15 to ~16 out of 20 on the grade scale. Beyond the aforementioned ranges, the joint distribution of the absences and the final grades of Portugese high school students appears relatively sparsely clustered. Overall, there appears to be weak negative association between the absences and the final grades of Portugese high school students as corroborated by a low negative Pearson’s correlation coefficient of -0.09137906. Even though there appears to exist an inverse relationship between the absences and the final grades of Portugese high school students, we can deduce that two most notable outliers include a student a final grade of 16 out of 20 despite 30 absences and a student with a final grade of 14 in spite of 32 absences.

This graph maps a scatterplot of students’ period 1 and period 2 grades. Each point on the graph links to the number of absences each student had – the larger the point, the more absences they had. Also plotted is a line of best fit for the data. We hypothesized that the students’ grades would not change from period to period, which is shown in via the positively-correlated fitted line in the graph. We also hypothesized that the number of absences for student would decrease as the grades became higher, because that would mean that more absences are correlated with lower grades and less absences are correlated with higher grades. Given this hypothesis, we expected to see the size of the points to get smaller as the grade increases. However, this is not the case, implying that there is no correlation between absences and grades.

Question 3

We also wanted to better understand the relationship between the final grade of a student and their mother’s education & mother’s job, as well as the the relationship between the final grade of a student and their father’s education and father’s job. To do this, we plotted the variables G3, Medu, and Mjob, as well as G3, Fedu, and Fjob using faceted histograms. These faceted histograms are informative because they display the conditional distributions of the final grade of Portugese high school students conditional on mother’s education and mother’s job, as well as the conditional distributions of the final grade of Portugese high school students conditional on father’s education and father’s job.

In the student performance dataset, there appear to be no students whose mothers works in health with no education, no students whose mothers works in services with no education, no students whose mothers works as a teacher with no education, no students whose mothers works as a teacher with primary education (4th grade), and no students whose mothers works as a teacher with 5th to 9th grade education.

Overall, as the level of mother’s education increases, the distribution of the final grade of Portugese high school students tends to become more left-skewed coupled with a shift in the center, i.e., the mode to the right, implying that Portugese high school students whose mothers have higher level of education tend to have higher final grades, holding mother’s job constant. Likewise, holding the level of mother’s education constant, Portugese high school students whose mothers work in health, services, or as teachers tend to have higher final grades than those whose mothers work in other or at home. However, the association between the final grade of Portugese high school students and their mothers’ job, holding their mother’s education constant, appears to be weaker than the association between the final grade of Portugese high school students and their mothers’ education, holding their mother’s job constant.

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  G3 by I(Medu:Mjob)
## Fligner-Killeen:med chi-squared = 14.851, df = 19, p-value = 0.732
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## I(Medu:Mjob)  19    562  29.603   3.003 2.06e-05 ***
## Residuals    629   6201   9.858                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of the Fligner-Killeen test of homogeneity of variances is that the variance of final grade is equal across all mother’s education - mother’s job interaction groups (samples). Contrarily, the alternative hypothesis of the Fligner-Killeen test of homogeneity of variances is that the variance of final grade is not equal across all mother’s education - mother’s job interaction groups (samples), i.e., the variance of final grade is not equal for at least one mother’s education - mother’s job interaction group (sample). The test statistic of the Fligner-Killeen test of homogeneity of variances is 14.851 at 19 degrees of freedom, and the p-value is 0.732. Since the p-value is greater than 0.05, we fail to reject the null hypothesis in favor of the alternative hypothesis at \(\alpha\) = 5% because there exists strong enough evidence to corroborate that the variance of final grade is equal all across all mother’s education - mother’s job interaction groups (samples).

The null hypothesis of the two-way analysis of variance test is that the mean of final grade is equal across all mother’s education - mother’s job interaction groups (samples). Contrariwise, the alternative hypothesis of the two-way analysis of variance test is that the mean of final grade is not equal across all mother’s education - mother’s job interaction groups (samples), i.e., the mean of final grade is not equal for at least mother’s education - mother’s job interaction group (sample). The F-value of the two-way analysis of variance test is 3.003 at 19 degrees of freedom, and the p-value is 2.06e-05. Since the p-value is lesser than 0.05, we reject the null hypothesis in favor of the alternative hypothesis at \(\alpha\) = 5% since there does not exist sufficient evidence to substantiate that the mean of final grade is equal across all mother’s education - mother’s job interaction groups (samples).

In the student performance dataset, there appear to be students whose fathers work in health with no education, no students whose fathers work in health with primary education (4th grade), no students whose fathers work in services with no education, no students whose fathers work as teachers with no education, and no students whose fathers work as teachers with 5th to 9th grade education. There appear to be fewer students whose fathers who work at home, in health, or as teachers than students whose mothers who work at home, in health, or as teachers. Contrarily, there appear to be almost an equal number of students whose fathers work in other or services and students whose mothers work in other and services. Furthermore, there appear to be almost an equal number of students whose fathers have no education and students whose mothers have no education. Contrarily, the appear to be more students whose fathers have received primary education (4th grade) or 5th to 9th grade education than students whose mothers have received primary education (4th grade) or 5th to 9th grade education than students. In contrast, there appear to be fewer students whose fathers have received secondary education or higher education than students whose mothers have received secondary education or higher education.

Overall, as the level of father’s education increases, the distribution of the final grade of Portugese high school students tends to become slightly more left-skewed - except for the distribution of the final grades of students whose fathers have secondary education - coupled with a slight shift in the center, i.e., the mode to the right, implying that Portugese high school students whose fathers have higher level of education tend to have slightly higher final grades, holding father’s job constant. However, the association between the final grades of Portugese high school students and their father’s education, holding their father’s job constant, appears to be weaker than the association between the final grades of Portugese high school students and their mother’s education, holding their mother’s job constant. More precisely, the direct relationship or trend between the final grades of Portugese high school students and their parents’ education, holding their parents’ job constant, appears to weaker for fathers and stronger for mothers. Likewise, holding the level of father’s education constant, Portugese high school students whose fathers work in health, services, or as teachers tend to have higher final grades than those whose fathers work in other or at home. However, the association between the final grade of Portugese high school students and their fathers’ job, holding their father’s education constant, appears to be almost as strong as or only slightly weaker than the association between the final grade of Portugese high school students and their fathers’ education, holding their father’s job constant.

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  G3 by I(Fedu:Fjob)
## Fligner-Killeen:med chi-squared = 30.255, df = 19, p-value = 0.04864
##               Df Sum Sq Mean Sq F value  Pr(>F)   
## I(Fedu:Fjob)  19    428   22.54   2.238 0.00194 **
## Residuals    629   6335   10.07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of the Fligner-Killeen test of homogeneity of variances is that the variance of final grade is equal across all father’s education - father’s job interaction groups (samples). In contrast, the alternative hypothesis of the Fligner-Killeen test of homogeneity of variances is that the variance of final grade is not equal across all father’s education - father’s job interaction groups (samples), i.e., the variance of final grade is not equal for at least one father’s education - father’s job interaction group (sample). The test statistic of the Fligner-Killeen test of homogeneity of variances is 30.255 at 19 degrees of freedom, and the p-value is 0.04864. Since the p-value is lesser than 0.05, we reject the null hypothesis in favor of the alternative hypothesis at α = 5% because there does not exist strong enough evidence to corroborate that the variance of final grade is equal all across all father’s education - father’s job interaction groups (samples).

The null hypothesis of the two-way analysis of variance test is that the mean of final grade is equal across all father’s education - father’s job interaction groups (samples). On the other hand, the alternative hypothesis of the two-way analysis of variance test is that the mean of final grade is not equal across all father’s education - father’s job interaction groups (samples), i.e., the mean of final grade is not equal for at least father’s education - father’s job interaction group (sample). The F-value of the two-way analysis of variance test is 2.238 at 19 degrees of freedom, and the p-value is 0.00194. Since the p-value is lesser than 0.05, we reject the null hypothesis in favor of the alternative hypothesis at α = 5% since there does not exist sufficient evidence to substantiate that the mean of final grade is equal across all father’s education - father’s job interaction groups (samples).

Question 4

After looking at the relationship of the education and occupation of parents impact on final grade, we wanted to explore other support systems. Particularly, we wanted to see what kinds of students received the most amount of additional educational support, and if the amount each student received depended on support from their school and/or their family.

To analyze this research question, we made a faceted stacked bar chart. This plot shows the proportion of students who receive familial educational support given the amount of classes they have failed in the past. Furthermore, it is faceted on whether or not students also receive educational support from the school. This visualization is informative because it shows that the proportion of students who have failed 2 or more classes, that get both educational support from both their school and family, is much greater than students that only receive familial educational support. This provides insight as to what kinds of students have access to multiple academic resources, and begs the question: Does a student having support from the school push families to also provide support to the student?

Conclusion

From our studies, we were able to draw a few different conclusions. First, we saw that it seemed that students who had more alcoholic drinks on the workday, had a statistically significant increase in their absences. We also saw that there appears to be weak negative association between the absences and the final grades of Portugese high school students as corroborated by a low negative Pearson’s correlation coefficient of -0.09. Even though it appears that an inverse relationship between the absences and the final grades of Portugese high school students exists, the negative association between the two appears to be very weak. Next, we found that holding parents’ jobs constant, the more educated the parents were, the higher the grades of the students, and this association is stronger for mothers than fathers. Holding education constant, we see that students with parents in health, services, or teaching professions had higher final grades, on average, than students whose parents had stay-at-home jobs, or other types of jobs. Finally, we saw that the proportion of students who have failed 2 or more classes and that get educational support from both their school and family, is much higher than students that only have failed and receive only familial educational support.

We saw a lot of potential for future studies using this dataset. To further research themes from question 1, we could check if there is a correlation between workday alcohol consumption and absences mediated by sex, or if the same correlation between workday alcohol consumption and absences holds for weekend alcohol consumption. To further explore the nature of absences on grades, it may be worth exploring the correlation between student study time and grades, as there may be a direct relationship between the amount of time that a student studies and their academic performance. Finding a correlation between these two variables would be useful in seeing if extra studying improves students’ grades or does not change the outcome of the grades. To study parents education further, we could explore the relationships between the final grade and mother’s education, holding mother’s job constant, and between the final grade and father’s education, holding father’s job constant, to check if they still hold (i.e., remain relatively constant or follow the same trend) if the dataset is subsetted by the value of the student’s guardian (‘mother’, ‘father’ or ‘other’) or by the parent’s cohabitation status (‘T’ - living together or ‘A’ - apart). Lastly, to better understand family support, we can look to see if a student having support from the school pushes families to also provide support to the student. Answering these questions seems like the next logical step for the vision of this project, and would allow for a more robust understanding of this dataset and the relationships its variables possess.