Introduction

We are examining the factors related to alcohol consumption for high school students in Portugal. We will be using the Student Alcohol Consumption dataset from https://www.kaggle.com/uciml/student-alcohol-consumption. It has 395 rows and 33 columns. Each observation corresponds to one secondary school student from the schools Gabriel Pereira or Mousinho da Silveira. Our response variables are weekday alcohol consumption (Dalc) and weekend alcohol consumption (Walc).

Variables
school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - 1 hour)
studytime - weekly study time (numeric: 1 - 10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese:
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

We want to look into three different research questions:
1. What are the main variables associated with alcohol consumption?
2. What is the distribution of alcohol consumption conditioned on different romantic relationship status and gender?
3. What is the relationship between absences and alcohol consumption, conditioned on quality of family relationships?

Research Question 1: What are the main variables associated with alcohol consumption?

1. Variables associated with Weekend Alcohol Consumption

To answer this question, we first used a regression model to pick out some of the most statistically significant variables. Not surprisingly, Weekday alcohol consumption (Dalc) is highly correlated with Weekend alcohol consumption (Walc). Next, I used a boxplot to depict the relationship between these covariates and alcohol consumption. Starting with Walc:

We can see that Walc is higher for male student, students whose father’s job is in other and services industries. It is also negatively correlated with family relationship (famrel), positively correlated with going out with friend (ggout) and absences.

2. Variables associated with Weekday Alcohol Consumption

Next, moving on to Dalc:

Dalc is associated with Mother’s education, the reason for picking the school, positively correlated with freetime after school (freetime).

3. Dendrogram of Dissimilarity Matrix

We wanted to see if our variables were able to cluster the students into the various alcohol consumption groups, so we created two dendrograms with the same clustering, but one colored by weekend alcohol consumption and the other colored by weekday alcohol consumption. We first used a dissimilarity distance matrix to include our categorical variables in the clustering. Complete linkage tended to give us the best results, so we included that here.

The labels are associated with levels of alcohol consumption from 1 (low) to 5 (high). The dendrogram seems able to cluster the students who drink more from the students who drink less. We can see that for the weekday consumption, there are pockets of students who drink more (purple, green, blue), while for the weekend alcohol consumption, we can see clearer clusters of students that are lighter drinkers (pink and red) versus heavier drinkers.

4. Dendrogram of Quantitative Variables

We then wanted to do an analysis using our scaled quantitative variables (age, travel time, study time, failures, and absences), scaling them rather than using the dissimilarity matrix to compare the results. We used complete linkage and colored by alcohol consumption levels again.

These dendrograms using only the quantitative variables give similar conclusions to the other dendrograms, but distinguish the levels of weekend alcohol consumption less clearly. This seems to confirm that the categorical variables give us some insight on student alcohol consumption levels.

Research Question 2: What is the distribution of alcohol consumption conditioned on different romantic relationship status and gender?

5. Distribution of Alcohol Consumption by Relationship Status for Weekday and Weekend

We wanted to learn about the distribution of alcohol consumption conditioned on different romantic relationship status and gender, which suggests we should examine alcohol consumption of single students versus students in a relationship. We created barplots comparing both weekday and weekend alcohol consumption levels of these two groups.

Based on the above graph, the amount that students in a relationship and not in a relationship drink in both the weekday and weekend seems to be very similar. Additionally, it seems that students across all relationship statuses seem to drink more on the weekends, with more students falling into the higher categories of alcohol consumption on the weekend than on the weekday. It seems that relationship status is not an important factor in alcohol consumption.

6. Distribution of Alcohol Consumption by Gender and Relationship Status

Since this dataset is looking at data for secondary school students, activities such as dating could be linked to amount of alcohol consumption as they are both affected by similar things, such as student experimentation and strictness of parents. Additionally, we assume that relationships and drinking behavior would tend to differ for different genders, so we would like to observe that as well. We created a series of stacked barplots, one for each combination of relationship status and time of week (weekend vs. weekday), colored by gender.

It seems that there are much fewer females that consume higher amounts of alcohol, whereas the distribution of males consuming each amount of alcohol seems to be more even, which makes sense due to the physical differences of males and females of the same age. In general, it seems that those who are single consume similar amounts of alcohol compared to those who are in a relationship across both genders.This indicates that gender may be an important factor in alcohol consumption amounts, but not necessarily relationship status.

Research Question 3: What is the relationship between absences and alcohol consumption, conditioned on quality of family relationships?

We also wanted to take a deeper look at the relationship between alcohol consumption and absences, conditioned on quality of family relationships. To do so, we created several faceted scatterplots.

We can see that for weekday alcohol consumption, there are more observations where people have lower alcohol consumption on the 1-5 scale. For weekend alcohol consumption, it has a more even alcohol consumption distribution between 1-5. We do not see any meaningful relationships between absences and alcohol consumption even when we condition it on family relationships. We may want to look at the other variables that will give us a more discernible relationship.

Conclusion

From our dendrograms and boxplots, it seems that both our quantitative and categorical variables give insight into student alcohol consumption amounts. Based on analysis on some of the individual variables we thought would be related to alcohol consumption, it seems that alcohol consumption does vary based on gender but not as much on romantic and family relationship status. In conclusion, we see that the variables that we initially thought would affect alcohol consumption among students actually do not have a relationship with alcohol consumption. We would like to continue exploring other variables in this dataset that may be associated with alcohol consumption among these students in Portugal.