2.1 Exploratory Data Analysis (EDA)
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readr)
library(ggthemes)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(dendextend)
school = read.csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")
school <- school[!is.na(school$MD_EARN_WNE_P10), ]
school$MD_EARN_WNE_P10 = as.numeric(school$MD_EARN_WNE_P10)
school$PCTPELL = as.numeric(school$PCTPELL)
school$PCTFLOAN = as.numeric(school$PCTFLOAN)
school$TUITFTE = as.numeric(school$TUITFTE)
school$DEBT_MDN = as.numeric(school$DEBT_MDN)
school$CONTROL <- factor(school$CONTROL, levels = c(1, 2, 3), labels = c("Public", "Private Nonprofit", "Private For-Profit"))
college <- read_csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")
college$CONTROL <- factor(college$CONTROL,
levels = c(1, 2, 3),
labels = c("Public", "Private Nonprofit", "Private For-Profit"))
college <- college %>%
mutate(ADM_RATE = as.numeric(ADM_RATE),
SAT_AVG = as.numeric(SAT_AVG)) %>%
filter(is.finite(ADM_RATE), is.finite(SAT_AVG))
college <- college[!is.na(college$MD_EARN_WNE_P10), ]
college$MD_EARN_WNE_P10 <- as.numeric(college$MD_EARN_WNE_P10)
ggplot(school, aes(x = MD_EARN_WNE_P10)) +
geom_histogram(binwidth = 5000, fill = "skyblue", color = "white", aes(y = ..density..)) +
labs(title = "Histogram of Median Earning",
x = "Median Earning",
y = "Density") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 5))
We can see that the variable median earning(MD_EARN_WNE_P10)
approximately follows a normal distribution that is slightly right
skewed with mean around 38000. There are no sign of a need for
transformation. Since we intend to predict relationship between Median
earnings and several variables, we can plot scatter plots to see the
overall trend.
We can also look at the data by differentiating the school types.
ggplot(college, aes(x = MD_EARN_WNE_P10, fill = CONTROL)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Median Earnings by School Type",
x = "Median Earnings (10 years after enrollment)",
y = "Density",
fill = "School Type") +
scale_fill_manual(values = c("Public" = "maroon4",
"Private Nonprofit" = "palegreen4",
"Private For-Profit" = "blue3")) +
theme_minimal()
We can take a closer look at the relationship of the variables
through a density plot. From the plot, it is evident that there are both
differences and similarities across the distributions for each school
type. The distributions overlap significantly, indicating that there’s a
common earnings range where graduates from all types of institutions are
likely to fall. However, there’s a noticeable difference in the spread
and skewness of these distributions, suggesting variability in the
economic outcomes for graduates from these institutions. The tails of
the distributions extend towards higher earnings, especially for Private
For-Profit institutions, indicating that there are graduates who earn
significantly more than the median, which is particularly notable for
stakeholders interested in the potential for high post-graduation
earnings.
The plot also suggests that the distribution for Private For-Profit
schools may have a slightly higher mode than Public and Private
Nonprofit schools, which could be interpreted as a higher concentration
of graduates from Private For-Profit schools in the higher earnings
bracket. The wider bases for each distribution indicate a broader range
of earnings among graduates, with the implication that there’s less
consistency in earnings within each school type. This variability and
the presence of higher earners among Private For-Profit graduates might
be factors to explore in further detail, especially when considering the
impact of educational institution types on long-term earnings. Such
insights can be valuable for students making educational choices, for
policymakers focusing on education and employment, and for institutions
working towards improving graduate outcomes.
df <- read.csv("/Users/haoyuangui/Downloads/Most-Recent-Cohorts-Institution 2.csv")
df$ADM_RATE <- as.numeric(df$ADM_RATE)
df$SAT_AVG <- as.numeric(df$SAT_AVG)
df$TUITFTE <- as.numeric(df$TUITFTE)
df$PCTFLOAN <- as.numeric(df$PCTFLOAN)
df$PCTPELL <- as.numeric(df$PCTPELL)
df$DEBT_MDN <- as.numeric(df$DEBT_MDN)
df$MD_EARN_WNE_P10 <- as.numeric(df$MD_EARN_WNE_P10)
2.2 Principal Component Analysis (PCA)
To gain an initial insight into how various quantitative factors -
such as admission rates, SAT scores, tuition costs, and financial aid -
might correlate and collectively influence future median earnings, we
will apply principal component analysis (PCA). As seen
above, we have six key quantitative factors of consider, excluding the
future earning variable, and it’s difficult to visualize all these
variables simultaneously. As a result, we wanted to apply PCA, an
effective tool for dimensionality reduction. By transforming our large
set of variables into a smaller, more manageable one, while still
retaining most of the information, it makes the data easier to
understand without significant loss of details. Then, to determine how
many principal components to use, we will create an elbow
plot, or a scree plot, as below.
library(dplyr)
library(tidyr)
library(factoextra)
df_quant <- df %>%
select(c(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)) %>%
drop_na(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)
pca <- prcomp(df_quant,
center = TRUE, scale. = TRUE)
fviz_eig(pca, addlabels = TRUE) +
geom_hline(yintercept = 100 * (1 / ncol(df_quant)),
linetype = "dashed", color = "red")
There are 6 quantitative variables in the dataset, and thus there are
6 principal components. The elbow plot suggests that
the first principal component accounts for almost half of the variation
in the dataset (47.3%), while the second accounts for about 24.4% of the
variation. After the first two principal components, the proportion of
explained variation drops substantially and starts to become flat.
Additionally, only the first two components are above the horizontal
line (at 1 divided by the number of variables). Therefore, it’s
reasonable for us to choose \(k = 2\)
and plot the first two principal components only.
However, the principal components by themselves aren’t directly
interpretable. To explore in what ways they are related to the original
variables in the data, we will create a biplot of the
first two principal components. We will also color the data points by
their earning category based on the median earnings ten years after
enrollment. This allows us to examine how future earnings are associated
with the original quantitative variables in the data. The biplot is
shown below:
df_quant <- df %>%
select(c(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN, MD_EARN_WNE_P10)) %>%
drop_na(ADM_RATE, SAT_AVG, TUITFTE, PCTFLOAN, PCTPELL, DEBT_MDN)
percentiles <- quantile(df_quant$MD_EARN_WNE_P10, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
df_quant <- df_quant %>%
mutate(earning_category = case_when(
is.na(MD_EARN_WNE_P10) ~ NA_character_,
MD_EARN_WNE_P10 <= percentiles[1] ~ '1',
MD_EARN_WNE_P10 <= percentiles[2] ~ '2',
MD_EARN_WNE_P10 <= percentiles[3] ~ '3',
TRUE ~ '4')) %>%
mutate(earning_category = factor(earning_category, levels = c('NA', '1', '2', '3', '4'),
labels = c('NA', '0-25%', '25-50%', '50-75%', '75-100%')))
pc_matrix <- pca$x
df_quant <- df_quant %>%
mutate(pc1 = pc_matrix[,1],
pc2 = pc_matrix[,2])
fviz_pca_biplot(pca,
label = "var",
alpha.ind = .5,
repel = TRUE,
habillage = df_quant$earning_category,
pointshape = 19)
Based on the biplot above, we can make the following conclusions:
Since two vectors whose angle is less than 90 degrees are
positively correlated, it appears that SAT averages and tuition costs
are positively correlated. Admission rates, percents of undergraduates
receiving Pell Grants, and percents receiving federal student loans are
positively correlated. The median debt is positively correlated with
both tuition costs and percents of students receiving federal student
loans.
Higher SAT scores and higher tuition costs are associated with
higher future earnings.
Higher admission rates and higher percents of undergraduates
receiving Pell Grants as well as federal student loans are associated
with lower future earnings.
We will now take a deeper look of our research questions.