Introduction

The data set we are working with is from Kaggle, and contains information on students in a Portuguese language class from two public high schools in the Alentejo region of Portugal. The data was collected using school reports and questionnaires, and contains information on 33 variables for 649 students. Each row of the data set corresponds to a different student, and each column reports on the value of a certain variable for a given student. These variables include demographic data (age, sex, parent’s industry of employment. ect…), social data (relationship status, amount of free time, alcohol consumption, ect…), well-being data (quality of family relationships, Internet access, health status, ect…), and school data (number of absences, classes failed, extracurricular activities, ect…). The data consists of around 60% female students and around 40% male students. The students range in age from 15 years old to 22 years old. We are interested in studying any relationships that exist between a student’s weekend alcohol consumption, and the demographic, social, well-being, and school variables. Weekend alcohol consumption is recorded as a categorical variable with levels from one to five, where one represents very low consumption and five represents very high consumption.

As can be seen in the above graph, most of the student’s ages are between 15 and 18, and there looks to be some medium-to-heavy weekend alcohol consumption among all groups. We will explore this alcohol consumption further through our three research questions:

  1. How does frequency of alcohol consumption relate to student performance?
  2. Do features of the mother’s side or father’s side of the family affect a student’s alcohol consumption more?
  3. How does overall well-being relate to alcohol consumption?

Research Question #1:

How does frequency of alcohol consumption relate to student performance?

Alcohol use during adolescence may be a catalyst for academic underachievement, based on several cross-sectional (El Ansari and Stock 2010) and longitudinal studies (Meda et al. 2017). Heavy drinking in particular among adolescents and young adults has been linked to lower grades and increased absences (Latvala et al. 2014).

However, the nature of the underlying relationship between academic achievement and alcohol use is still unclear. It has been proposed that common risk factors such as home instability, mental health, and poor familial support could predispose adolescents to both alcohol usage and academic underachievement (Wang and Fredricks 2014). In this section we will specifically look at students’ final grades (a numeric value which ranges from 1 to 20) and whether or not they passed the Portugese language class.

From the side-by-side boxplots of final grade by weekday and workday alcohol consumption, we can see there is a small but monotonic decrease in the median final grade associated with alcohol consumption. However, only higher levels of weekend alcohol consumption appear to be associated with the median final grade, with a decrease from a grade point median of 12 to 11 for the moderately high (4) and very high (5) alcohol consumption levels.

On the other hand, the boxplots for weekday alcohol consumption display an association across a wider range of alcohol consumption levels. We observe a decrease in the grade point median from 12 to 11 for moderately low (2) through moderately high (4) consumption levels, and a decrease to a grade point median of 10 for very high (5) consumption levels.

Next we look at alcohol consumption by their course outcome.

Using threshold defined in (Cortez and Silva 2008) we created a binary course outcome variable indicating a “passing” status in the course for students with final grades greater than or equal to 10, and “failing” status otherwise. The combined violin and box-plots communicate the conditional distribution of alcohol consumption given the course outcome. The distribution of weekend alcohol consumption has a larger spread than the weekday alcohol consumption, as fewer students consume higher levels of alcohol during the week. Nonetheless, the distribution of both weekend and weekday alcohol consumption appears to be more heavily weighted towards the extremes for students who failed their course than for students who passed. This disparity is especially pronounced for weekend alcohol consumption.

To test the hypothesis that weekday and weekend alcohol consumption frequency is lower for students who achieved a passing grade, we perform one-tailed two-sample t-test at a 95% confidence level for the true difference in means.

Welch Two Sample t-test: Weekend Alcohol Consumption by Course Outcome (continued below)
Test statistic df P value Alternative hypothesis
2.824 131.8 0.002739 * * greater
mean in group Failed mean in group Passed
2.63 2.217
Welch Two Sample t-test: Weekday Alcohol Consumption by Course Outcome (continued below)
Test statistic df P value Alternative hypothesis
2.595 119.9 0.005326 * * greater
mean in group Failed mean in group Passed
1.77 1.454

Given our small p-values, we reject the null hypothesis that the mean weekend alcohol consumption (\(p=\) 0.0027393) and weekday alcohol consumption (\(p=\) 0.0053263) is greater for passing students than for failing students.

Research Question #2:

Do features of the mother’s side or father’s side of the family affect a student’s alcohol consumption more?

Our second area of interest is how different qualities and characteristics of a student’s mother and father affect their alcohol consumption. One interesting characteristic to analyze is the parent’s job. How does a student’s parent’s jobs affect their likelihood of different levels of drinking, and specifically is there a difference depending on which parent has those jobs?

Above is a graph showing the mean level of weekend drinking by students, given the job category of their mother and father. There are some interesting observations one can make. For example, if a student’s father works at home ,and their mother works in health, on average their drinking level will be very low. However, if the reverse is true, that their mother works at home and father works in health, now the average drinking level is much higher. Another interesting observation is that by looking at the Father teacher column and Mother teacher row, we see that the average drinking level given Father teacher appears to be lower than the average given Mother teacher. There are a few such jobs where there appears to be a difference in drinking levels depending on whether it is the mother or father with the job, so we will explore that difference below.

The above graph compares the Weekend Drinking Level given a specific job, given whether it is the student’s mother or father with that job. First, we observe that no matter mother or father, for each job, higher drinking levels correspond to less students. There also isn’t much data for parents who work in health, so it is hard to make strong conclusions about the data. More interestingly, for parents who are teachers or who work at home, the reduction in count of students as drinking level increases seems to differ. Specifically, if it is the father with that job, the reduction seems to be larger. The hypothesis then is that the mean drinking level of a student is lower if their father has one of those two jobs, versus if their mother has those two jobs.

To properly analysis that hypothesis, we perform two t-tests below:

## 
##  Welch Two Sample t-test
## 
## data:  father.teacher$Walc and mother.teacher$Walc
## t = -1.2131, df = 77.016, p-value = 0.2288
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6848122  0.1662936
## sample estimates:
## mean of x mean of y 
##  2.000000  2.259259
## 
##  Welch Two Sample t-test
## 
## data:  father.health$Walc and mother.health$Walc
## t = -2.1046, df = 83.776, p-value = 0.03832
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.99947014 -0.02830763
## sample estimates:
## mean of x mean of y 
##  1.833333  2.347222

Since we are making two comparisons, to correct the p-values, we will multiply them by 2. This leads to essentially a p-value of 0.4576 for the health test, so we fail to reject the null hypothesis that the average student’s drinking level given their father works in health is the same as the average drinking level if their mother works in health.

For the teacher test, we get a p-value of 0.07664, which is very close to the conventional 0.05. Thus it is fair to say that our data suggests a difference between the average student’s drinking level given their father is a teacher vs their mother is a teacher, where it seems that if your father is a teacher, you are less likely to drink compared to if your mother is a teacher.

Research Question #3:

How does overall well-being relate to alcohol consumption?

For our third research question, we are interested in exploring the relationships between different measures of well-being, as well as their correlation with weekend alcohol consumption. The variables we aim to focus on are current health status, family relationships, romantic relationship, participation in activities, amount of free time, and frequency of time spent with friends.

Since the correlation of activities, romantic relationship, and family relationships with weekend alcohol consumption, respectively, are low in value, we examine below a pairs plot of a subset of the well-being quantities:

The plot above demonstrates a highly significant correlation of 0.389 between frequency of outings and weekend alcohol consumption, as well as notable correlations of health and free time with alcohol, respectively. It is also important to note the relatively high correlation of 0.346 between frequency of outings and amount of free time, which makes intuitive sense for high school students.

Building upon this association, we want to explore the linear relationship between well-being and weekend alcohol consumption.

Regressing weekend alcohol consumption on the well-being quantities, we see a result of significant features that are family relationship quality, current health status, and frequency of going out with friends. These predictors have regression coefficients -0.19, 0.12, and 0.45, respectively, indicating that consumption decreases with higher quality of family relationships, while it increases with better health and higher frequency of outings.

While the important variables determined from multiple regression indicate some linearity, we proceed with principal component analysis to determine how combinations of well-being variables describe the variation of these students’ data.

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6
## Standard deviation     1.2205 1.0457 1.0027 0.9548 0.9309 0.7957
## Proportion of Variance 0.2483 0.1822 0.1676 0.1520 0.1444 0.1055
## Cumulative Proportion  0.2483 0.4305 0.5981 0.7501 0.8945 1.0000

Shown below is a scree plot consisting of the six variables related to well-being:

As indicated by the dotted red line in this plot, the first three dimensions explain almost 60% of the variation of these data. Since the first two components together represent 43%, we want to visualize their linear relationships even further, as displayed in the following biplot.

Note that the arrows associated with amount of free time and frequency of outings are long and facing the left direction, indicating that as these two quantities increase, individually, the first principal component tends to decrease significantly. On the other hand, as the variable representing romantic relationships increases, the second principal component shows a noticeable decrease in value. It also appears that quality of family relationships and health status have similar trends with the top two principal components, while overall health and participation in activities share almost no correlation with each other.

The six measures of well-being which we explored share relationships of varying strength; in particular, frequency of outings and amount of free time are highly correlated with one another, and the quality of family relationships follows similar trends as health status. With regard to weekend alcohol consumption, we conclude a decrease in value with higher quality of family relationships, while it increases as health status and frequency of outings grow in value.

Conclusion

Overall, we found a number of different results. When examining the relationship between alcohol consumption and student performance, we found that students who consume more alcohol during the weekdays have slightly lower grades than students who consume no to little alcohol during the weekdays. Weekend alcohol consumption does not seem to have a significant impact on students’ grades. In terms of the relationship between alcohol consumption and the different parental sides of a student’s family, we found that if a student’s father is a teacher, they are less likely to drink compared to if their mother is a teacher. There were no such significant differences for other professions. Finally, when studying the relationship between overall well-being of students and alcohol consumption, we found that alcohol consumption decreases with higher quality of family relationships, while it increases as health status and frequency of outings grow.

There are a number of questions related to this data that could be studied in the future. One is how family income relates to the variables in the study, and whether it has a relationship with alcohol consumption or not. We were not able to explore this question as there was no data on family income. Studying this could further explain some of the results in our study, especially if there are correlations with our predictor variables and alcohol consumption, because family income is a potential confounding variable. Future work could also use more complex prediction models we have not learned, such as decision trees, to try and predict alcohol consumption levels of students based off various categorical variables in the dataset. This would be interesting because schools might be able to use such results to identify students that need more support.

References

Cortez, Paulo, and Alice Maria Gonçalves Silva. 2008. “Using Data Mining to Predict Secondary School Student Performance.”
El Ansari, Walid, and Christiane Stock. 2010. “Is the Health and Wellbeing of University Students Associated with Their Academic Performance? Cross Sectional Findings from the United Kingdom.” International Journal of Environmental Research and Public Health 7 (2): 509–27.
Latvala, Antti, Richard J Rose, Lea Pulkkinen, Danielle M Dick, Tellervo Korhonen, and Jaakko Kaprio. 2014. “Drinking, Smoking, and Educational Achievement: Cross-Lagged Associations from Adolescence to Adulthood.” Drug and Alcohol Dependence 137: 106–13.
Meda, Shashwath A, Ralitza V Gueorguieva, Brian Pittman, Rivkah R Rosen, Farah Aslanzadeh, Howard Tennen, Samantha Leen, et al. 2017. “Longitudinal Influence of Alcohol and Marijuana Use on Academic Performance in College Students.” PLoS One 12 (3): e0172213.
Wang, Ming-Te, and Jennifer A Fredricks. 2014. “The Reciprocal Links Between School Engagement, Youth Problem Behaviors, and School Dropout During Adolescence.” Child Development 85 (2): 722–37.