Introduction

For most, after the choice of where to attend college, the choice of major is the most important decision of most adolescents’ lives. Not only does it have a large impact on the next four years of their lives, it is also somewhat deterministic of what the next 40 years of a person’s working life can entail. From expected earnings for a certain major to the marginal benefit of a graduate degree versus an undergraduate, we hope to explore these questions and more to determine whether or not these relationships exist, to what extent they do, and, of course, to see if we all made the right choice. With the working world changing faster than ever in part thanks to the role played by data science in the digital revolution, we hope to use the skills acquired here to shed some light on what it takes to make it in our modern world.

Dataset Description

The dataset we will be working with is pulled from FiveThirtyEight’s article on a similar subject called The Economic Guide To Picking A College Major, which nicely compiled data mainly from the US Census Bureau from 2010-2012. This dataset is composed of three main data files: all-ages.csv, recent-grads.csv, and grad-students.csv. These sets have 174, 175, and 174 rows respectively with the following variables.

  • Rank: Rank is the rank by median earnings for a certain major
  • Major_code: The code according to FOD1P
  • Major: Description of the major
  • Major_category: Category the major falls in (STEM, humanities etc.)
  • Total: Total number of people with the major, helps us understand popularity
  • Sample_size: The sample size of full-time, year-round students used for earnings
  • Men: Male graduates
  • Women: Female graduates
  • ShareWomen: The proportion of women in the major
  • Employed: Number employed
  • Full_time: Those employed 35 hours or more a week
  • Part_time: Those employed less than 35 hours a week
  • Full_time_year_round: Those employed full time for at least 50 weeks
  • Unemployed: Number unemployed
  • Unemployment_rate: Unemployed/(Unemployed + Employed) for a major
  • Median: Median full time year round earnings
  • P25th: 25th percentile of earnings
  • P75th: 75th percentile of earnings
  • College_jobs: Number of people with a job requiring a college degree
  • Non_college_jobs: Number of people with a job not requiring a college degree
  • Low_wage_jobs: Number in low-wage service jobs

With these we will be examining the following three questions:

  • Which major categories are similar to each other?

Here we want to determine which majors are more similar to each other outside of the standard categorizations of STEM, humanities, etc and in what sense they are similar. We will be doing so with a PCA analysis.

  • How are the median and the IQR of the wage distributed, and how do they differ across different majors?

Here we want to gain more insight into the distribution of earnings by major allowing us to see if some typically lower paying majors have opportunities for higher pay. Our main variables of interest are P25, Median Earnings, and P75.

  • How important is a college degree for each major category?

We will explore this question by examining unemployment rates of recent graduates and how many jobs of those that are employed require a college degree at all. Our variables of interest are College_jobs, Non_college_jobs, and Employed.

Question 1: Which major categories are similar to each other? (PCA)

To begin with, we want to learn about the general similarities among majors, which suggest that we should examine quantitative variables in the dataset using Principal Component Analysis (PCA). Before we do the analysis, we should first check if we want all the quantitative variables, and if not so, decide which variables we should use. For example, given the known variable Total, variables Employed, Unemployed, Unemployment_rate actually refer to one statistic, so we only want to keep one of them. The variables we use are Total, ShareWomen, Unemployment_rate, Median, Part_time_rate (=Part_time/Employed), Non_college_jobs_rate (=Non_college_jobs/Employed), and Low_wage_jobs_rate (=Low_wage_jobs/Employed). After calculating the matrix of principal components, we use PC1, which explains 41.5% of the variance, and PC2, which explains 15.9% of the variance, to make a scatter plot, as shown below. To make the plot more legible, we made another scatter plot where only majors in the major categories that include more than 10 majors are shown.

From the scatter plot, we can see that Engineering majors are on the left and Humanities majors are on the right, with more major categories such as Business, Education, etc. in the middle of the two. This result fits our intuition about major categories. Now we already know what major categories are similar to or different from each other, but how do we interpret PC1 and PC2? To get insight into this, we make a biplot with arrows to see how the variables are related to PCs.

From the biplot, we find that the median and the proportion of women are negatively correlated. This is partially because majors with low median earnings (Humanities, etc.) have higher proportions of women than majors with high median earnings (Engineering, etc.). We also find that the unemployment rate seems to be uncorrelated with the median. In other words, majoring in subjects with a high median earning does not help you get employed more easily than others.

Question 2: How are the median and the IQR of the wage distributed, and how do they differ across different majors?

In this part, we are going to explore the center and spread the wages of graduates in different major types. We choose the Median of the wage to represent the center of its distribution, and the IQR to represent its spread. These statistics provide us with a general idea of what the wage of a graduate student will be like, and it is important for students to be aware of them which helps them choose their majors. We will first focus on the general distribution of the Median and IQR of the wage using a heat map.

From the heat map above, we can see that for most of the majors, the median wage is between $15000 and $75000, and the IQR is between $10000 and $60000. This indicates that the median and IQR of wages differ a lot across different majors. The part with the highest density is centered around a median around $50000 and an IQR around $35000, which refers to the center and spread of wages of graduate students from most of the majors. With this information, a student who is entering college without declaring his/her major can estimate that his/her wage after graduation is likely to be between $15000 and $85000, if he/she finds a job. In addition, we also notice that the shape of the distribution slopes upward, which indicates that majors with higher median wage tend to have higher IQR of the wage. This is reasonable because majors with higher median wages usually require more advanced skills, which means that related jobs are more varied, resulting in higher IQR of wages.

We then move to see whether the median and IQR of wages differ across different major types, and if yes, how do they differ. In this part we conclude those majors into 12 categories, which are shown below in the legend of the graphs.

We first make a Chi-squared test of the mean and standard deviation of the distribution of median wages of different major types to check if the distribution of median wage differs across different major types. Below is the output of this test:

## 
##  Chi-squared test for given probabilities
## 
## data:  Median_marginal$mean
## X-squared = 25251, df = 15, p-value < 2.2e-16
## 
##  Chi-squared test for given probabilities
## 
## data:  na.omit(Median_marginal$sd)
## X-squared = 34418, df = 14, p-value < 2.2e-16

Since the output of both tests has a p-value less than 0.05, we reject the null hypothesis that the mean and standard deviation of the median wage of different major types are all the same, which suggests that the distribution of median wage in different major types are different.

After confirming this, we make a side-by-side boxplot to explore how the distribution of median wage differs across different major types.

From the boxplot above, we observe that the distribution of the median wage of different majors varies a lot. We notice that Graduate students with major types of Engineering and Computer & Mathematics have significant higher median wages than other major types as the median of these distributions are both over $62500; those who graduate with a major of Psychology & Social Work or Arts have significant lower median wages than other major types because the highest value in these distributions are lower than $50000. These pieces of information help students choose their major if they have an anticipation for their wages after graduation.

We then move to see if the distribution of the IQR of wages differs across different major types. Similar to the process above, we first make a Chi-squared test of the mean and standard deviation of the distributions of IQR:

## 
##  Chi-squared test for given probabilities
## 
## data:  IQR_marginal$mean
## X-squared = 30806, df = 15, p-value < 2.2e-16
## 
##  Chi-squared test for given probabilities
## 
## data:  na.omit(IQR_marginal$sd)
## X-squared = 23679, df = 14, p-value < 2.2e-16

The result of the tests shows that the p-values are both less than 0.05, indicating that the null hypothesis is rejected and not all major types have the same distribution of IQR of wages.

After confirming this, we make a side-by-side boxplot to see how the distribution of IQR of wages differs across different major types.

From the graph above, we observe that the IQR of wages tend to be higher for students majoring in Physical Sciences and Engineering, as the median of these distributions are all above $50000. This indicates that students who graduate in these major types tend to face more uncertainty of wage when they find a job. We also notice that the distribution of IQR of wages for graduate students in Education shows a significantly lower range than all other major types, which means that those students have less variety in their wages.

Based on all information discussed above, students can decide whether a major is suitable for him/her or not. For example, if the student wants higher wages and does not care about uncertainty, Engineering and Computer & Mathematics will be good choices; if the student cares about uncertainty and does not require a high wage, he/she might consider majoring in Education.

Question 3: How important is a college degree for each major category?

Here we want to examine the relationship between unemployment rate and major by category. We hope to understand the role of a majors category can play in terms of expected employment for a student. We would expect to see the higher performing majors in terms of expected earnings to also correlate with higher rates of employment thus lower unemployment. Additionally, we would like to see what portion of the jobs acquired by different majors are jobs that would require a college degree to obtain, allowing us to see whether or not the cost of the degree is truly worth it in obtaining these jobs. Again, we would expect to see the degrees that obtain higher wages to also have a higher percentage of jobs that require a degree to obtain. We test these theories with the following graphs.

The density plot shows the distribution of unemployment rates in majors among various major categories. Several major categories appear to have a normal distribution, including arts, biology & life sciences, health, and humanities & liberal arts. Another notable feature is the tall peaks in communications & journalism and social sciences around the 5 - 10% range. Overall, the unemployment rate generally is below 15% with only a few majors having an unemployment rate that is slightly higher, suggesting that recent graduates face similar unemployment rates regardless of major.

This stacked bar plot shows the proportions of employed recent graduates with jobs requiring a college degree (in red) and employed recent graduates with jobs not requiring a college degree (in blue) for various major categories. There is a large range of proportions across the major categories. For some major categories, the proportion of jobs that do not require a college degree is around 75% (e.g. business, law & public policy), while for other major categories the proportion of jobs not requiring a college degree is around 25% (e.g. education, engineering, health). The plot suggests that STEM related major categories have a lower proportion of recent grad jobs that do not require a college degree compared to non-STEM related categories, but further analysis would need to be done to prove this.

Conclusion

We saw throughout our three points of interest that not just acquiring a degree is important, but the major one selects can be equally important in determining one’s expected career earnings. Many expect the STEM fields to have the highest expected earnings and this was confirmed throughout, as a category outperforming compared to any other group. Through our research we saw that many of the sciences can actually lag behind other fields within STEM itself with fields in Engineering, Computer Science, and Mathematics remaining king in terms of expected earnings. We continued to see the lag in the fields related to the humanities and arts especially in earnings with far more landing jobs considered “low-wage” or not even needing a college degree. However, we saw specifically that even within some of the majors themselves, oftentimes the distribution of earnings still allows for hope for a high earning future, even for those majors whose median earnings may report otherwise. All of these findings point towards a consistent trend, that though college may be the general path for many to a high paying future, the most important factor remains the job field at the end. None of this accounts for personal interest, passion for a subject, or enjoyment of content, but if you see college as a path to a wealthier future, a degree may only matter insofar as it remains in an in-demand field with often exclusive, high-paying jobs.

Future Work

Our research primarily focused on the relationship between earnings and majors and the resulting distributions of earnings for these majors, but in the future we could see the value in focusing more on individual majors in hopes of understanding why those majors may or may not outperform some of their counterparts. It is well known that the STEM fields generally earn the most, especially compared to their humanities counterparts, but with more data and further work could we determine how it is these humanities fields can remain competitive on the job market? By further examining the types of jobs with higher career earnings such as software engineering or data science, we can take lessons learned to other fields such as English or History and see how the higher earners in these fields tend to fare in the long run. This would require more granular data and more specialized research but the utility in understanding the ever growing gap between those with technical skills and those without we could offer insights into how to not only slow the growth of this gap, but hopefully reverse it. Furthermore, with female enrollment and graduation rates eclipsing their male counterparts, further research could examine how it is that oftentimes male students graduate into higher earning sectors. Much work has been done in these fields already to shrink this gender gap but further research is still clearly needed.