1. Introduction

1.1 Motivation

To empower students in making informed decisions for a prosperous career, it is crucial to delve into the attributes of colleges that might have a meaningful impact on their future earnings. Therefore, our team choose to investigate the characteristics of a college that may have an impact on students’ future earnings.

The institutional-level dataset is adapted from College Scoreboard website of the US Department of Education. The data files at the institutional level span from the academic year 1996-97 to 2021-22, encompassing comprehensive information on thousands of colleges and universities across the US. The dataset includes details on institutional characteristics, such as their costs, financial aid received by students, enrollment details, median salaries after graduation, and so on.

1.2 Data Description

The dataset includes about 6,543 rows, each row representing an American college or university. There are 3,232 columns, which contain a wide range of information for each institution. (Link to the dataset: https://collegescorecard.ed.gov/data/)

Since there are a lot of variables, we chose the following candidate factors that may affect earnings:

Finance-Related Variables: tuition costs of the institutions, student debt, financial aid, etc.
Selectivity-Related Variables: SAT scores of the admitted students, admission rates, etc.
Locations (States) of the colleges

Therefore, the variables of our primary of interest are as below:

MD_EARN_WNE_P10: Median earnings (dollars) of students working and not enrolled, 10 years after entry.
TUITFTE: Tuition costs of the institution.
DEBT_MDN: The median loan debt accumulated at the institution by all student borrowers of federal loans.
PCTPELL: Percent of undergraduates receiving federal Pell grants.
PCTFLOAN: Percent of undergraduates receiving federal loans.
ADM_RATE: Admission rate of the institution.
SAT_AVG: Average SAT scores of students.
CONTROL: Factor variable that indicates the type of school. 1: public school, 2: private nonprofit school, 3: private for-profit school.
STABBR: Two-letter abbreviation of states and territories of the United States
UGDS: Number of degree-seeking undergraduate students enrolled in the Fall

1.3 Primary Research Questions

We have the following research questions:

Would finance-related variables, such as tuition of the colleges and student debt, significantly impact the earnings of graduates? Specifically speaking, how do percentage of federal Pell grant given, percentage of federal student loan given, net tuition revenue, median debt, and institutional characteristics such as school types, interact to influence the median earnings of graduates 10 years post-enrollment?
Would selectivity-related variaables, such as SAT scores of the admitted students and admission rates, significantly impact the earnings of graduates? How do pre-enrollment academic performance metrics, specifically SAT scores, and institutional characteristics, such as school types and admission rates, interact and influence the median earnings of graduates 10 years post-enrollment?
Would the state-level locations of the colleges significantly impact the earnings of graduates?

For the sake of consistency, we will use the variable MD_EARN_WNE_P10 as our variable for graduate earning. It indicates median earnings (dollars) of students working and not enrolled, 10 years after their entry.

2. Preliminary Data Analysis

2.1 Exploratory Data Analysis (EDA)

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(readr)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(dendextend)
school = read.csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")
school <- school[!is.na(school$MD_EARN_WNE_P10), ]
school$MD_EARN_WNE_P10 = as.numeric(school$MD_EARN_WNE_P10)
school$PCTPELL = as.numeric(school$PCTPELL)
school$PCTFLOAN = as.numeric(school$PCTFLOAN)
school$TUITFTE = as.numeric(school$TUITFTE)
school$DEBT_MDN = as.numeric(school$DEBT_MDN)
school$CONTROL <- factor(school$CONTROL, levels = c(1, 2, 3), labels = c("Public", "Private Nonprofit", "Private For-Profit"))
college <- read_csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")

college$CONTROL <- factor(college$CONTROL, 
                          levels = c(1, 2, 3),
                          labels = c("Public", "Private Nonprofit", "Private For-Profit"))

college <- college %>% 
  mutate(ADM_RATE = as.numeric(ADM_RATE),
         SAT_AVG = as.numeric(SAT_AVG)) %>%
  filter(is.finite(ADM_RATE), is.finite(SAT_AVG))

college <- college[!is.na(college$MD_EARN_WNE_P10), ]
college$MD_EARN_WNE_P10 <- as.numeric(college$MD_EARN_WNE_P10)

ggplot(school, aes(x = MD_EARN_WNE_P10)) +
  geom_histogram(binwidth = 5000, fill = "skyblue", color = "white", aes(y = ..density..)) +
  labs(title = "Histogram of Median Earning",
       x = "Median Earning",
       y = "Density") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 5))

We can see that the variable median earning(MD_EARN_WNE_P10) approximately follows a normal distribution that is slightly right skewed with mean around 38000. There are no sign of a need for transformation. Since we intend to predict relationship between Median earnings and several variables, we can plot scatter plots to see the overall trend.

We can also look at the data by differentiating the school types.

ggplot(college, aes(x = MD_EARN_WNE_P10, fill = CONTROL)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Median Earnings by School Type",
       x = "Median Earnings (10 years after enrollment)",
       y = "Density",
       fill = "School Type") +
  scale_fill_manual(values = c("Public" = "maroon4", 
                                "Private Nonprofit" = "palegreen4", 
                                "Private For-Profit" = "blue3")) +
  theme_minimal()

We can take a closer look at the relationship of the variables through a density plot. From the plot, it is evident that there are both differences and similarities across the distributions for each school type. The distributions overlap significantly, indicating that there’s a common earnings range where graduates from all types of institutions are likely to fall. However, there’s a noticeable difference in the spread and skewness of these distributions, suggesting variability in the economic outcomes for graduates from these institutions. The tails of the distributions extend towards higher earnings, especially for Private For-Profit institutions, indicating that there are graduates who earn significantly more than the median, which is particularly notable for stakeholders interested in the potential for high post-graduation earnings.

The plot also suggests that the distribution for Private For-Profit schools may have a slightly higher mode than Public and Private Nonprofit schools, which could be interpreted as a higher concentration of graduates from Private For-Profit schools in the higher earnings bracket. The wider bases for each distribution indicate a broader range of earnings among graduates, with the implication that there’s less consistency in earnings within each school type. This variability and the presence of higher earners among Private For-Profit graduates might be factors to explore in further detail, especially when considering the impact of educational institution types on long-term earnings. Such insights can be valuable for students making educational choices, for policymakers focusing on education and employment, and for institutions working towards improving graduate outcomes.

df <- read.csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")
df$ADM_RATE <- as.numeric(df$ADM_RATE)
df$SAT_AVG <- as.numeric(df$SAT_AVG)
df$TUITFTE <- as.numeric(df$TUITFTE)
df$PCTFLOAN <- as.numeric(df$PCTFLOAN)
df$PCTPELL <- as.numeric(df$PCTPELL)
df$DEBT_MDN <- as.numeric(df$DEBT_MDN)
df$MD_EARN_WNE_P10 <- as.numeric(df$MD_EARN_WNE_P10)

2.2 Principal Component Analysis (PCA)

To gain an initial insight into how various quantitative factors - such as admission rates, SAT scores, tuition costs, and financial aid - might correlate and collectively influence future median earnings, we will apply principal component analysis (PCA). As seen above, we have six key quantitative factors of consider, excluding the future earning variable, and it’s difficult to visualize all these variables simultaneously. As a result, we wanted to apply PCA, an effective tool for dimensionality reduction. By transforming our large set of variables into a smaller, more manageable one, while still retaining most of the information, it makes the data easier to understand without significant loss of details. Then, to determine how many principal components to use, we will create an elbow plot, or a scree plot, as below.

library(dplyr)
library(tidyr)
library(factoextra)
df_quant <- df %>% 
  select(c(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)) %>% 
  drop_na(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)

pca <- prcomp(df_quant,
              center = TRUE, scale. = TRUE)

fviz_eig(pca, addlabels = TRUE) + 
  geom_hline(yintercept = 100 * (1 / ncol(df_quant)),
             linetype = "dashed", color = "red")

There are 6 quantitative variables in the dataset, and thus there are 6 principal components. The elbow plot suggests that the first principal component accounts for almost half of the variation in the dataset (47.3%), while the second accounts for about 24.4% of the variation. After the first two principal components, the proportion of explained variation drops substantially and starts to become flat. Additionally, only the first two components are above the horizontal line (at 1 divided by the number of variables). Therefore, it’s reasonable for us to choose \(k = 2\) and plot the first two principal components only.

However, the principal components by themselves aren’t directly interpretable. To explore in what ways they are related to the original variables in the data, we will create a biplot of the first two principal components. We will also color the data points by their earning category based on the median earnings ten years after enrollment. This allows us to examine how future earnings are associated with the original quantitative variables in the data. The biplot is shown below:

df_quant <- df %>% 
  select(c(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN, MD_EARN_WNE_P10)) %>% 
  drop_na(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)

percentiles <- quantile(df_quant$MD_EARN_WNE_P10, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)

df_quant <- df_quant %>%
  mutate(earning_category = case_when(
    is.na(MD_EARN_WNE_P10) ~ NA_character_,
    MD_EARN_WNE_P10 <= percentiles[1] ~ '1',
    MD_EARN_WNE_P10 <= percentiles[2] ~ '2',
    MD_EARN_WNE_P10 <= percentiles[3] ~ '3',
    TRUE ~ '4')) %>%
  mutate(earning_category = factor(earning_category, levels = c('NA', '1', '2', '3', '4'),
                                   labels = c('NA', '0-25%', '25-50%', '50-75%', '75-100%')))

pc_matrix <- pca$x
df_quant <- df_quant %>%
  mutate(pc1 = pc_matrix[,1],
         pc2 = pc_matrix[,2])

fviz_pca_biplot(pca, 
                label = "var", 
                alpha.ind = .5,
                repel = TRUE,
                habillage = df_quant$earning_category, 
                pointshape = 19)

Based on the biplot above, we can make the following conclusions:

Since two vectors whose angle is less than 90 degrees are positively correlated, it appears that SAT averages and tuition costs are positively correlated. Admission rates, percents of undergraduates receiving Pell Grants, and percents receiving federal student loans are positively correlated. The median debt is positively correlated with both tuition costs and percents of students receiving federal student loans.
Higher SAT scores and higher tuition costs are associated with higher future earnings.
Higher admission rates and higher percents of undergraduates receiving Pell Grants as well as federal student loans are associated with lower future earnings.

We will now take a deeper look of our research questions.

3. Finance-Related Factors

3.1 Research Question 1

How do financial-related factors, specifically percentage of federal Pell grant given, percentage of federal student loan given, net tuition revenue, median debt, and institutional characteristics such as school types, interact to influence the median earnings of graduates 10 years post-enrollment?

3.2 Scatterplot of Tuition and Earning by College Type

scatter_plot <- ggplot(school, aes(x = TUITFTE, y = MD_EARN_WNE_P10, color = CONTROL)) +
  geom_point() +
  labs(title = "Scatter Plot of Median Earning and Tuition",
       x = "Tuition",
       y = "Median Earning",
       color = "Type of school")+
  scale_x_continuous(expand = c(0, 0), limits = c(0, 30000)) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(group = CONTROL, color = CONTROL))
scatter_plot

The scatterplot shows a positive linear relationship between median earning and tuition. Private for-profit school has the flattest slope which means that it has comparatively high overall tuition cost while yielding a very low median earning. Public school on the other hand are more scattered at both low tuition and high median earning, it has higher overall median earning than private schools while having a lower comparative cost. Private Nonprofit schools are in the middle of the other two type where it has medium tuition fee and medium earnings. It is a remarkable data for college applicants to choose their suitable school according to their current budget and their future expectations of income after graduation.

3.3 Dendogram of Several Finance-Related Factors

subset_data <- school[, c("CONTROL", "PCTPELL", "PCTFLOAN", "TUITFTE", "DEBT_MDN", "MD_EARN_WNE_P10")]
subset_data <- na.omit(subset_data)
dist_matrix <- dist(subset_data)
hc <- hclust(dist_matrix, method = "ward.D2")
dend <- as.dendrogram(hc)
dend_color <- set(dend, "branches_k_color", k = 3)
control_color = ifelse(subset_data$CONTROL == "Public", "red",
                    ifelse(subset_data$CONTROL == "Private Nonprofit", "green", "blue"))
dend_color <- set(dend_color, "labels_colors", 
                       order_value = TRUE, control_color)

plot(dend_color,main = "Dendrogram of Finance-Related Variables")

The hierarchical tree diagram offers a visual representation of the relationships among schools based on key variables such as school type (public, private nonprofit, private for-profit), the fraction of undergraduates receiving federal Pell grants, the fraction receiving federal student loans, average net prices for tuition and living expenses, and the median debt of graduates. The leaves are colored by the CONTROL variable to distinguish between type of schools, the level of branches are also set to 3 to seek for similarities between type of schools.

We see from the diagram that although there are some outliers where small portion of each color are scattered rather than clustered, the majority of the data that belongs to the same CONTROL group are scattered by school type. Moreover, most of the public school data falls under the same branches with private for-profit school as we see that blue and red are closely tied together under the red branch. This suggest that public school and private for-profit school have more similarities in terms of MD_EARN_WNE_P10, CONTROL, PCTPELL, PCTFLOAN, TUITFTE, and DEBT_MDN compared with public non-profit school which is useful for both student, teachers or media to compare and contrast schools.

3.4 Regression Analysis

model <- lm(MD_EARN_WNE_P10 ~ CONTROL + PCTPELL + PCTFLOAN + TUITFTE + DEBT_MDN, school)
summary(model)

## 
## Call:
## lm(formula = MD_EARN_WNE_P10 ~ CONTROL + PCTPELL + PCTFLOAN + 
##     TUITFTE + DEBT_MDN, data = school)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54133  -5419  -1348   3812  64949 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.732e+04  6.202e+02  60.168   <2e-16 ***
## CONTROLPrivate Nonprofit  -6.309e+03  5.268e+02 -11.977   <2e-16 ***
## CONTROLPrivate For-Profit -1.294e+04  5.403e+02 -23.954   <2e-16 ***
## PCTPELL                   -2.323e+04  1.240e+03 -18.736   <2e-16 ***
## PCTFLOAN                   2.717e+03  1.077e+03   2.523   0.0117 *  
## TUITFTE                    7.403e-01  2.912e-02  25.420   <2e-16 ***
## DEBT_MDN                   1.015e+00  4.498e-02  22.561   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10310 on 4271 degrees of freedom
##   (2265 observations deleted due to missingness)
## Multiple R-squared:  0.6033, Adjusted R-squared:  0.6027 
## F-statistic:  1082 on 6 and 4271 DF,  p-value: < 2.2e-16

Intercept: The intercept is estimated to be 37,320. This is the predicted value of MD_EARN_WNE_P10 when all predictor variables (“CONTROL”, “PCTPELL”, “PCTFLOAN”, “TUITFTE”, and “DEBT_MDN”) are zero.

CONTROL (Private Nonprofit and Private For-Profit): The coefficient again represent the difference in MD_EARN_WNE_P10 between the baseline category (Public) and the specified private category. For Private Nonprofit schools, the predicted earnings are estimated to be 6,309 units lower than Public schools. For Private For-Profit schools, the predicted earnings are estimated to be 12940 units higher than Public schools. Both tests shows a p-value of less than 0.05 which indicate that the type of school indeed affect median earning after graduation.

PCTPELL: The coefficient for PCTPELL is -23,230. This means that for each percent increase in PCTPELL, the model predicts a decrease in MD_EARN_WNE_P10 of 23,230 units, holding other variables constant. Since p-value is less than 0.05, we can reasonably say that this coefficient suggests a negative relationship between fraction of undergraduates who received a federal Pell grant and their earnings 10 years after enrollment.

PCTFLOAN: The coefficient for PCTFLOAN is 2,717. This suggests that for each percent increase in PCTFLOAN, the model predicts a decrease in MD_EARN_WNE_P10 of 2,717 units. However, the p-value is less than 0.05, indicating that there is indeed a negative relationship between fraction of undergraduates who received a federal student loan and their earnings 10 years after enrollment.

TUITFTE: The coefficient for TUITFTE is 0.7403. This means that for each unit increase in TUITFTE, the model predicts a decrease in MD_EARN_WNE_P10 of 0.7403 units, holding other variables constant. Since p-value is less than 0.05, it is confident to conclude that this coefficient suggests a positive relationship between The net tuition revenue and student earnings 10 years after enrollment.

DEBT_MDN: The coefficient for DEBT_MDN is 1.015. This means that for each unit increase in DEBT_MDN, the model predicts a decrease in MD_EARN_WNE_P10 of 1.015 units, holding other variables constant. Since p-value is less than 0.05, we therefore conclude that this coefficient suggests a postitive relationship between Median debt of those who completed their degrees and their earnings 10 years after enrollment.

Overall, this model suggests that all predictors are significant when predicting earnings. In particular, PCTPELL are negatively related with MD_EARN_WNE_P10. PCTFLOAN, TUITFTE and DEBT_MDN are positively related with MD_EARN_WNE_P10. Lastly, CONTROL is related to MD_EARN_WNE_P10 such that private school are negatively related to MD_EARN_WNE_P10 compared with public school.

4. Selectivity of the College — SAT Scores, Admission Rates

4.1 Research Question 2

How do pre-enrollment academic performance metrics, specifically SAT scores, and institutional characteristics, such as school types and admission rates, interact to influence the median earnings of graduates 10 years post-enrollment?

4.2 Explanatory Data Analysis

The variables I chose are: 1. SAT_AVG: Average SAT scores. 2. CONTROL: 1: public school, 2: private nonprofit school, 3: private for-profit school. 3. ADM_RATE: Admission Rate

ggplot(college, aes(x = ADM_RATE, y = SAT_AVG, color = CONTROL)) + 
  geom_point(size = 1.2, alpha = 0.8) +  
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, aes(group = CONTROL, color = CONTROL)) +  
  labs(title = "Admission Rate vs Average SAT Scores by School Type",
       x = "Admission Rate (%)",  
       y = "Average SAT Scores",
       color = "School Type") +
  theme_bw() +  
  theme(axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10)) +
  scale_color_brewer(palette = "Set1") +
    scale_color_manual(values = c("Public" = "maroon4", 
                                "Private Nonprofit" = "palegreen4", 
                                "Private For-Profit" = "blue3")) +
  scale_x_reverse(labels = scales::percent_format(accuracy = 1),  
                  breaks = scales::pretty_breaks(n = 5)) +  
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5))

The scatter plot displays a collection of points that represent different institutions, with the average SAT scores plotted against the admission rates. Each point is colored according to the type of institution: public, private nonprofit, or private for-profit. Generally, it appears that institutions with lower admission rates (more selective) have higher average SAT scores, which could indicate a competitive applicant pool. The slopes of the lines also provide insights: the slope for private institutions is steeper than for public ones, it might suggest that changes in admission rates for private institutions have a more pronounced effect on the average SAT scores compared to public institutions.

4.3 Regression Analysis

college <- college[!is.na(college$MD_EARN_WNE_P10), ]
college$MD_EARN_WNE_P10 <- as.numeric(college$MD_EARN_WNE_P10)
model <- lm(MD_EARN_WNE_P10 ~ SAT_AVG + CONTROL + ADM_RATE, data = college)
summary(model)

## 
## Call:
## lm(formula = MD_EARN_WNE_P10 ~ SAT_AVG + CONTROL + ADM_RATE, 
##     data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30104  -6099   -547   4294  72063 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -34927.902   4004.569  -8.722   <2e-16 ***
## SAT_AVG                       78.305      2.729  28.699   <2e-16 ***
## CONTROLPrivate Nonprofit    1117.697    664.167   1.683   0.0927 .  
## CONTROLPrivate For-Profit   8379.758   5719.212   1.465   0.1432    
## ADM_RATE                   -2094.078   1744.742  -1.200   0.2303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9859 on 1006 degrees of freedom
## Multiple R-squared:  0.5476, Adjusted R-squared:  0.5458 
## F-statistic: 304.4 on 4 and 1006 DF,  p-value: < 2.2e-16

Intercept: The intercept is estimated to be -34,927.902. This is the predicted value of MD_EARN_WNE_P10 when all predictor variables (SAT_AVG, CONTROL, and ADM_RATE) are zero.

SAT_AVG: The coefficient for SAT_AVG is 78.305. This means that for each one-unit increase in SAT_AVG, the model predicts an increase in MD_EARN_WNE_P10 of 78.305 units, holding other variables constant. Since SAT_AVG is a standardized test score, this coefficient suggests a positive relationship between higher SAT scores and higher earnings 10 years after enrollment.

CONTROL (Private Nonprofit and Private For-Profit: These coefficients represent the difference in MD_EARN_WNE_P10 between the reference category (Public) and the specified category. For Private Nonprofit schools, the predicted earnings are estimated to be 1,117.697 units higher than Public schools. For Private For-Profit schools, the predicted earnings are estimated to be 8,379.758 units higher than Public schools. However, these differences are not statistically significant as their p-values are greater than 0.05.

ADM_RATE: The coefficient for ADM_RATE is -2,094.078. This suggests that for each one-unit increase in ADM_RATE (which represents a lower admission rate, indicating higher selectivity), the model predicts a decrease in MD_EARN_WNE_P10 of 2,094.078 units. However, the p-value for ADM_RATE is relatively high (0.2303), indicating that this variable is not statistically significant in predicting MD_EARN_WNE_P10.

Overall, the model suggests that SAT_AVG is a significant predictor of MD_EARN_WNE_P10, with higher SAT scores associated with higher earnings. However, the variables CONTROL (school type) and ADM_RATE (admission rate) do not appear to be statistically significant in predicting earnings 10 years after enrollment in this model.

# Scatterplot of MD_EARN_WNE_P10 vs. SAT_AVG
ggplot(college, aes(x = SAT_AVG, y = MD_EARN_WNE_P10, color = CONTROL)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "Median Earnings vs. SAT Scores",
       x = "Average SAT Scores",
       y = "Median Earnings (10 years after enrollment)",
       color = "School Type") +
  scale_color_manual(values = c("Public" = "maroon4", 
                                "Private Nonprofit" = "palegreen4", 
                                "Private For-Profit" = "blue3")) +
  theme_minimal() +
  theme(plot.title = element_text(size = 12)) +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 8))

For the only significant predictor of median earnings, the average SAT scores, the linear regression model demonstrates a clear positive correlation: higher SAT scores are associated with higher earnings 10 years post-enrollment. This underscores the SAT’s predictive validity regarding future financial outcomes within the scope of the variables analyzed. However, it’s important to note that while SAT scores may serve as a strong indicator, they are not the sole determinant of future earnings, as many other unmodeled factors also play a critical role.

4.4 MDS Analysis

mds_data <- college %>% 
  select(ADM_RATE, SAT_AVG, MD_EARN_WNE_P10) %>%
  na.omit()

dist_matrix <- dist(mds_data, method = "euclidean")
mds_result <- cmdscale(dist_matrix)
mds_df <- as.data.frame(mds_result)
names(mds_df) <- c("Dim1", "Dim2")

mds_df <- cbind(mds_df, college[rownames(mds_df), "CONTROL"])

ggplot(mds_df, aes(x = Dim1, y = Dim2, color = CONTROL)) +
  geom_point(size = 1, alpha = 0.8) +  
  geom_density_2d(color = "black", size = 0.5) +
  scale_color_manual(values = c("Public" = "blue", 
                                "Private Nonprofit" = "brown", 
                                "Private For-Profit" = "gold")) + 
  labs(title = "Institutional Similarity Landscape Based on Multivariate Data",
       subtitle = "Dimension 1 vs Dimension 2",
       x = "Dimension 1",
       y = "Dimension 2",
       color = "School Type") +
  theme_minimal(base_size = 12) +  
  theme(legend.position = "bottom")

In an investigation centered on the factors that influence post-graduation income, the MDS plot visually encapsulates the interplay between selectivity—represented by SAT scores and admission rates—and subsequent earnings. The dense clustering at the center for Public and Private Nonprofit schools suggests a correlation where selectivity parameters could be strong predictors of income. In contrast, the spread of Private For-Profit institutions across the plot implies a more complex relationship, where selectivity may be less indicative of earnings potential, perhaps overshadowed by program diversity, student services, or other institutional characteristics.

Further, the MDS dimensions potentially align with key aspects of selectivity and income influence; Dimension 1 could mirror the selectivity continuum, where lower values correspond to less selective schools, and higher values to more selective ones, impacting earnings in a predictable pattern. Dimension 2 might reflect additional factors affecting income, such as the quality of education, alumni networks, or geographical location. Outlying points, particularly among the Private For-Profit category, highlight the existence of exceptions where selectivity does not dictate income, suggesting that for certain institutions, unique elements might elevate graduate earnings regardless of their admissions selectivity.

4.5 Conclusion

In the context of predicting median earnings with selectivity, average SAT scores have a statistically significant positive correlation with earnings. This suggests that institutions with higher SAT scores tend to have graduates with higher median earnings, highlighting the SAT’s predictive value regarding financial outcomes after education. However, the type of control of the institution (public, private nonprofit, or private for-profit) and admission rates, while included in the model, do not show a statistically significant predictive value for earnings.

The scatter plots reinforce this conclusion, visually demonstrating the positive relationship between SAT scores and median earnings, with a wide variation in earnings at similar SAT score levels, suggesting other factors also play a role. The MDS plot with contour lines further indicates that institutions can be grouped based on similarities in multiple dimensions, likely beyond just SAT scores and earnings. These visualizations and statistical analyses provide insights into the complex landscape of higher education outcomes, emphasizing the importance of academic performance as captured by SAT scores while also hinting at a multitude of other factors influencing post-graduation earnings that are not captured in this specific model.

5 Regions Factors (States)

5.1 Research Question 3

Would the state-level regions of the colleges affect their graduates’ earning 10 years after their enrollment?

5.2 Statebins Map

To examine the influence of college locations on graduates’ earnings, we intend to generate a state-level map illustrating the distribution of earnings based on the geographical locations of universities. This spatial analysis will offer insights into potential regional factors impacting graduates’ economic outcomes. The explanatory variable is the states (STABBR) of which the college is located, and the response variable is the students’ earnings 10 years after their college enrollment.

library(tidyverse)
library(maps)
library(ggplot2)
library(dplyr)
library(statebins)
path <- "/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv"
content <- read.csv(path)
state_borders <- map_data("state")

average_earnings_by_state <- content %>%
  select(INSTNM, UGDS, STABBR, MD_EARN_WNE_P10) %>%

  mutate(
    MD_EARN_WNE_P10 = as.numeric(MD_EARN_WNE_P10),
    UGDS = as.numeric(UGDS),
    MD_EARN_WNE_P10 = ifelse(is.na(MD_EARN_WNE_P10), 0, MD_EARN_WNE_P10),
    UGDS = ifelse(is.na(UGDS), 0, UGDS)
  ) %>%
  group_by(STABBR) %>%
  summarize(
    Earning_in_USD = weighted.mean(MD_EARN_WNE_P10,UGDS)
    #sd_avg_earning = sd(MD_EARN_WNE_P10)
  )

statebins(state_data = average_earnings_by_state, state_col = "STABBR",
          value_col = "Earning_in_USD",
          ggplot2_scale_function = viridis::scale_fill_viridis, n=1) +
  labs(title = "Spatial Distribution of Annual Average College Graduate Earning\nby State 10 Years Post College Enrollment") +
  theme_statebins("right")

The graph illustrates that a decade after students’ enrollment, those who attended colleges in Washington, DC exhibit the highest annual average earnings, which is higher than 60000 USD, closely followed by Massachusetts. Graduates who went to college in Puerto Rico has the lowest average earnings, which is below 30000 USD. The statebin graph does not contain the following states, and thus they are not investigated in this graph: American Samoa, Federated States of Micronesia, Guam, Marshall Islands, Northern Mariana Islands, and Palau.

To further demonstrate the difference between the earnings of graduates who attened schools in different states, we will conduct an ANOVA test between the earnings of the graduates of the different states.

anova_data <- content %>%
  select(INSTNM, UGDS, STABBR, MD_EARN_WNE_P10) %>%
  mutate(
    MD_EARN_WNE_P10 = as.numeric(MD_EARN_WNE_P10),
    UGDS = as.numeric(UGDS),
    MD_EARN_WNE_P10 = ifelse(is.na(MD_EARN_WNE_P10), 0, MD_EARN_WNE_P10),
    UGDS = ifelse(is.na(UGDS), 0, UGDS)
  ) 
summary(aov(MD_EARN_WNE_P10 ~ STABBR, data = anova_data))

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## STABBR        58 1.248e+11 2.152e+09   4.548 <2e-16 ***
## Residuals   6484 3.068e+12 4.732e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After the test, we can see the P-value is smaller than 2e-16, which is less than 0.05. This leads us to reject the null hypothesis and conclude that there’s at least one state where the average earnings are different from the other test.

Therefore, we can conclude that the states of the colleges have a significant impact on the graduates’ earnings 10 years after their enrollment.

6. Conclusion

Financial wise, the regression model suggests that all financial predictors are significant when predicting earnings. In particular, fraction of undergraduates who received a federal Pell grant are negatively related with median earnings. Fraction of undergraduates receiving a federal student loan, net tuition revenue, and median debt are positively related with median earnings. Lastly, type of school is related to median earnings such that private school are negatively related to median earnings compared with public school as the baseline.

The states and territories where the colleges are located is also a significant predictor of graduates’ earning 10 years past their enrollment, as verified by the ANOVA test. There is a large difference on earning between some states, such as between Washington D.C. and Puerto Rico.

7. Future Work

Some areas for potential further research include how these factors influence graduates’ earning in historical data. We do not have time series data in this dataset, so this question cannot be investigated in our report. Another possible extension on the scope of our research question can include deeper investigation on the variable CONTROL(types of school). We notice that the CONTROL variable is significant in the linear regression model when predicting median earnings with financial related variables. However, it is not significant when predicting with selectivity related variables. To investigate the true beta of CONTROL, we might need a separate linear regression to predict median earning with only CONTROL variable. This gives more accurate estimates of the true relationship between types of school and median earning avoiding errors like overfitting. The statebins map also suffers from insufficient data. For each college, only the information is of median earning, instead of the average earning, of graduates is available for every college. Therefore, the map is showing the average of the median earning of each state. This may not be completely accurate. We would have produced a more accurate map had the data for the average earning of every school been available.

A Statistical Investigation on College Characteristics Impacting Undergraduate Students’ Post-Graduation Income

Donna Huang, Daiyan Chen, Tom He, Peter Gui

2023-12-11