Introduction

Dataset Description

 This data is on 139 universities in the United States, collected by Opportunity Insights. It contains 81 variables detailing application rates, admissions, tests score brackets, student income, and university prestige. This dataset provides a great way to view how admissions, applications, scores, and income vary between different tiers of universities and public versus private institutions. The rows of this dataset represent the student matriculation and admission outcomes from one of the 14 income brackets at a specific university. There are 1946 rows in total. The 14 income brackets are based on the student’s parents income percentile in relation to the US national income distribution. Of the expanse of variables in this dataset, this report focuses on income level, relative application and attendance rates, university tier, whether the school is public, whether it is a flagship university. Relative attendance is the proportion of students attending a given college among all test-takers within a specific parent income bin. This variable is represented by the proportion of mean attendance rate for all income bins for the given university. Relative application is the relative proportion of all the standardized test takers who sent their scores to that college. University tiers in this dataset go from 1 to 6, representing Ivy-Plus, Other elite college, Highly selective public college, Highly selective private college, Selective public college, and Selective private college, respectively. These tiers were based on Barron’s selectivity groupings. For some exploratory data analysis, tiers were combined to represent both private and public schools in the same Barron’s groupings. Additionally, for some plots, a dummy variable that indicated whether a school was among the top “tier” grouping, or not, was created, taking the value of 1 if it was and 0 if not.

Research Questions

This report focuses on the following three research questions:

1.) Does income level affect the application or attendance rate to a given college, or to more/less prestigious colleges as a group? Does this change relative to SAT scores?

2.) How does university tier (variable rearranged to be based on prestige) vary based on whether the school is public, private, and or a flagship school?

3.) What does the relationship between relative attendance and relative application look like? How does income affect this relationship?

  To answer these questions, we will first create several plots for exploratory data analysis before diving in to graphical analysis, as well as testing some of our theories using a few regressions.

Exploratory Data Analysis

Figure 1: MDS Plot, Colored by Tier

Figure 1: MDS Plot, Colored by Tier

 The Multidimensional Scaling (MDS) plot depicted above shows a representation of the relationships between universities based on their respective prestige tiers. MDS is a dimensionality reduction technique that takes high-dimensional data and transforms it to a lower-dimensional space, preserving pairwise distances between the data. Universities are represented as points with similarity of their attributes determining their position. The colors assigned to each point indicate the prestige tier to which the university belongs:

The clusters of points that are close together suggest a similarity in prestige tier among the corresponding universities while points that are more distant from one another indicate differences in prestige levels or unique characteristics. Taking a look at the graph above, we observe two primary groups both with a similar distribution of colored points with a majority being other elite schools with the other tiers in them as well. It is important to note that there is also a smaller group of other elite schools between the two clusters, which shows that there is a little more variation in other elite schools compared to other schools. This overall shows that there are lots of similarities between more selective schools, but there are still some elite schools that have differences between the other schools in this dataset of higher tier schools.

Figure 2: Bar Plot of University Prestige, Facetted by School Type and Colored by Flagship Status(Public)

Figure 2: Bar Plot of University Prestige, Facetted by School Type and Colored by Flagship Status(Public)

 This graph was created as part of the exploratory data analysis on the universities in this dataset in order to see the distribution of public and private schools within each university tier as well as the distribution of flagship schools, which is related to the second research question. This relates to other research questions involving the income of student’s parents as public schools tend to have a lower tuition. This graph shows that most of the schools in the dataset are private schools. The Ivy League schools are all private but those in the highly selective tier are about 40% public and those in the selective tier are majority public. The other elite schools are mostly private schools. The highest count of flagship schools are in the selective tier which makes sense given this is the highest concentration of public schools.
Figure 3: Heatmap of Relative Application and Relative Attendance

Figure 3: Heatmap of Relative Application and Relative Attendance

 The heat map above was created to view the different counts for the relative application and relative attendance variable. Above it can be seen that there appears to be a greater amount of data where relative application and relative attendance is lower. There are not many points where relative application is low but relative attend is high meaning there are not many cases where the application rate is low but the attendance rate is high for a college. From the information gathered from this map, it seems that the colleges in the data set are medium sized since most students tend to apply and attend larger colleges. By understanding this information, we can further investigate the third research question regarding the relationship between relative application and relative attendance through the lens of parental income.
Figure 4: Histogram of Application Rates, Facetted by Income

Figure 4: Histogram of Application Rates, Facetted by Income

 Here, it appears clear that there are more instances of higher application rates among higher parental income groups, though this does not obviously seem to change based on prestige. These variables were included primarily to shed light on the first research question, though more analyses is obviously needed.

Graphical Analysis

Figure 5: Relationship between Relative Attendance Rate over Relative Application

Figure 5: Relationship between Relative Attendance Rate over Relative Application

  The plot above is a linear regression plot by school selectivity tiers investigating the relationship between relative application rate and attendance rate. This plot is informative for the dataset as it allows for discovery of the effect of school selectivity tiers in the ratio of attendance over application rates. This plot clearly shows the positive correlation between relative application rate and relative attendance rate across any school, however, it is interesting to see the slopes of the linear regression lines for each tier. Interestingly, the slopes show that private schools and more competitive school tiers tend to have higher relative attendance over application ratios. Although highly selective public schools are more competitive than selective private, they seem to be lower which raises more questions about the dataset. These questions include the effect of private and public on attendance rates, why is there a higher application and attendance rate for private schools, as their regression lines are longer, and why do people tend to attend private schools even if they are less selective?
Figure 6: Correlation between Relative Application and Relative Attendance by Income

Figure 6: Correlation between Relative Application and Relative Attendance by Income

 The scatterplot above was created to attempt to answer the third research question regarding the relationship between relative attendance and relative application. The scatterplot shows that there appears to be a positive correlation between relative attendance and relative application since as one variable increases, the other does as well. Additionally, the points were colored by parental income, where the higher the income the bluer the points and the lower the income the blacker the points. Above it can be seen that the bluer points appear to have a greater spread which means that those of a higher income tend to have a larger spread across relative application and relative attendance. The darker points which represent those of lower income appear to have a cluster where relative application is 1, and relative attendance is less than 2.5 which shows a much smaller spread. In regards to the research question, it appears that there could be a relationship between relative attendance and relative application, but at minimum there appears to be a positive correlation. In regards to how income effects this, since the relationship between relative attendance and relative application does not seem to be impacted by income, thus there is not a relationship between relative attendance/application and parental income.
Figure 7: Dendrogram Displaying Relationship Between Relative Attendance and Application, Colored by Tier

Figure 7: Dendrogram Displaying Relationship Between Relative Attendance and Application, Colored by Tier

 The above dendrogram was created to observe the similarity between relative attendance/application levels and assist with the answering of research question 1. The blue labels here indicate the “most prestigious” colleges–as seen above, they are incredibly clustered together among one area of the dendrogram. This dendrogram was created using the averages for each college such that general trends could be analyzed and the labels could be properly read. This seems to indicate a high degree of similarity between prestigious colleges when it comes to relative attendance and relative application rates, while less prestigious colleges have more variation. This would seem to indicate that there is a difference between more and less presitigous colleges when it comes to relative attendance and application rates.

Hypothesis Testing and Analysis of Variance

Model Fitting and Summaries

 Four models were fit to assist in answering our research questions–primarily the first. Two predict relative attendance rate, while 2 predict relative application rate. Both use only parental income and the dummy indicating college prestige as predictors. However, the two regressions predicting either of the dependence variables differ in that 1 includes only the fixed effects of both variables, while the other includes an interaction between the two. This was done in order to see if the effect of one depends on the classification of another for each income group, for example if attendance rate is lower for the 99.5th percentile level for prestigious colleges as compared to non prestigious colleges.
## 
## Relative Application Rates, with an Interation with College Prestige
## ========================================================================================================
##                                                                      Dependent variable:                
##                                                      ---------------------------------------------------
##                                                                           rel_apply                     
##                                                                 (1)                       (2)           
## --------------------------------------------------------------------------------------------------------
## factor(par_income_bin)30                                   0.035 (0.056)             0.035 (0.059)      
## factor(par_income_bin)50                                  -0.030 (0.056)            -0.027 (0.059)      
## factor(par_income_bin)65                                  -0.064 (0.056)            -0.056 (0.059)      
## factor(par_income_bin)75                                  -0.053 (0.056)            -0.043 (0.059)      
## factor(par_income_bin)85                                   0.036 (0.056)             0.052 (0.059)      
## factor(par_income_bin)92.5                               0.198*** (0.056)          0.223*** (0.059)     
## factor(par_income_bin)95.5                               0.326*** (0.056)          0.359*** (0.059)     
## factor(par_income_bin)96.5                               0.406*** (0.056)          0.442*** (0.059)     
## factor(par_income_bin)97.5                               0.483*** (0.056)          0.522*** (0.059)     
## factor(par_income_bin)98.5                               0.586*** (0.056)          0.633*** (0.059)     
## factor(par_income_bin)99.400002                          0.730*** (0.056)          0.784*** (0.059)     
## factor(par_income_bin)99.5                               0.732*** (0.056)          0.786*** (0.059)     
## factor(par_income_bin)100                                0.765*** (0.056)          0.822*** (0.059)     
## factor(prestigious)1                                     -0.138*** (0.038)           0.179 (0.141)      
## factor(par_income_bin)30:factor(prestigious)1                                        0.001 (0.199)      
## factor(par_income_bin)50:factor(prestigious)1                                       -0.039 (0.199)      
## factor(par_income_bin)65:factor(prestigious)1                                       -0.098 (0.199)      
## factor(par_income_bin)75:factor(prestigious)1                                       -0.123 (0.199)      
## factor(par_income_bin)85:factor(prestigious)1                                       -0.180 (0.199)      
## factor(par_income_bin)92.5:factor(prestigious)1                                     -0.292 (0.199)      
## factor(par_income_bin)95.5:factor(prestigious)1                                     -0.377* (0.199)     
## factor(par_income_bin)96.5:factor(prestigious)1                                    -0.416** (0.199)     
## factor(par_income_bin)97.5:factor(prestigious)1                                    -0.457** (0.199)     
## factor(par_income_bin)98.5:factor(prestigious)1                                    -0.543*** (0.199)    
## factor(par_income_bin)99.400002:factor(prestigious)1                               -0.623*** (0.199)    
## factor(par_income_bin)99.5:factor(prestigious)1                                    -0.626*** (0.199)    
## factor(par_income_bin)100:factor(prestigious)1                                     -0.658*** (0.199)    
## Constant                                                 0.888*** (0.040)          0.861*** (0.041)     
## --------------------------------------------------------------------------------------------------------
## Observations                                                   1,946                     1,946          
## R2                                                             0.305                     0.319          
## Adjusted R2                                                    0.300                     0.309          
## Residual Std. Error                                      0.470 (df = 1931)         0.466 (df = 1918)    
## F Statistic                                          60.506*** (df = 14; 1931) 33.239*** (df = 27; 1918)
## ========================================================================================================
## Note:                                                                        *p<0.1; **p<0.05; ***p<0.01
## 
## Relative Attendance Rates, with an Interation with College Prestige
## ========================================================================================================
##                                                                      Dependent variable:                
##                                                      ---------------------------------------------------
##                                                                          rel_attend                     
##                                                                 (1)                       (2)           
## --------------------------------------------------------------------------------------------------------
## factor(par_income_bin)30                                   0.014 (0.094)             0.020 (0.098)      
## factor(par_income_bin)50                                   0.004 (0.094)             0.015 (0.098)      
## factor(par_income_bin)65                                  -0.004 (0.094)             0.015 (0.098)      
## factor(par_income_bin)75                                   0.013 (0.094)             0.034 (0.098)      
## factor(par_income_bin)85                                   0.075 (0.094)             0.097 (0.098)      
## factor(par_income_bin)92.5                                0.232** (0.094)          0.266*** (0.098)     
## factor(par_income_bin)95.5                               0.380*** (0.094)          0.419*** (0.098)     
## factor(par_income_bin)96.5                               0.482*** (0.094)          0.523*** (0.098)     
## factor(par_income_bin)97.5                               0.645*** (0.094)          0.688*** (0.098)     
## factor(par_income_bin)98.5                               0.859*** (0.094)          0.910*** (0.098)     
## factor(par_income_bin)99.400002                          1.163*** (0.094)          1.207*** (0.098)     
## factor(par_income_bin)99.5                               1.186*** (0.094)          1.221*** (0.098)     
## factor(par_income_bin)100                                1.448*** (0.094)          1.439*** (0.099)     
## factor(prestigious)1                                      -0.106* (0.063)            0.190 (0.237)      
## factor(par_income_bin)30:factor(prestigious)1                                       -0.072 (0.335)      
## factor(par_income_bin)50:factor(prestigious)1                                       -0.120 (0.335)      
## factor(par_income_bin)65:factor(prestigious)1                                       -0.215 (0.335)      
## factor(par_income_bin)75:factor(prestigious)1                                       -0.241 (0.335)      
## factor(par_income_bin)85:factor(prestigious)1                                       -0.263 (0.335)      
## factor(par_income_bin)92.5:factor(prestigious)1                                     -0.400 (0.335)      
## factor(par_income_bin)95.5:factor(prestigious)1                                     -0.453 (0.335)      
## factor(par_income_bin)96.5:factor(prestigious)1                                     -0.475 (0.335)      
## factor(par_income_bin)97.5:factor(prestigious)1                                     -0.496 (0.335)      
## factor(par_income_bin)98.5:factor(prestigious)1                                     -0.596* (0.335)     
## factor(par_income_bin)99.400002:factor(prestigious)1                                -0.503 (0.335)      
## factor(par_income_bin)99.5:factor(prestigious)1                                     -0.410 (0.335)      
## factor(par_income_bin)100:factor(prestigious)1                                       0.096 (0.335)      
## Constant                                                 0.819*** (0.067)          0.794*** (0.070)     
## --------------------------------------------------------------------------------------------------------
## Observations                                                   1,944                     1,944          
## R2                                                             0.287                     0.291          
## Adjusted R2                                                    0.281                     0.281          
## Residual Std. Error                                      0.784 (df = 1929)         0.785 (df = 1916)    
## F Statistic                                          55.366*** (df = 14; 1929) 29.058*** (df = 27; 1916)
## ========================================================================================================
## Note:                                                                        *p<0.1; **p<0.05; ***p<0.01

Anova Results

Table 1: Anova Results
Regressions for Relative Application Regressions for Relative Attendance
F-stat 2.9974000 0.8043
P-value 0.0002244 0.6562
 The model using an interaction between parental income and college prestige appears to better explain the the data and predict relative application rates, as compared to a model using just the fixed effects of both without an interaction. This can be seen by the F-test conducted by the ANOVA function–the p-value of comparison between the two models was 0.0002244, much less than the typical .05 significance level. On the other hand, the model using the same interaction to predict relative attendance rates was proven to be less effective than the model without the interactions–the f-statistic when comparing the nested models was 0.8043, resulting in a p-value of 0.6562, much greater than the typical significance level of .05. Thus, for the first comparison, we must reject the null hypothesis that the coefficients added by the interaction of prestige and income is 0, while for the second we cannot reject this null hypothesis. This would indicate that a school being prestigious does not change the effect parental income on relative attendance rates, but it does change the effect of parental income on relative application rates. However, given all the interaction coefficients in both regressions are observed to be negative but the majority of the parental income coefficients were positive, particularly the higher income bins, we must assume that the positive effect of of parental income on relative application rates is somewhat negated by the school being prestigious. That is to say, application rates to prestigious colleges do not increase as much due to parental income being higher.

Unanswered Questions and Future Work

  Overall, in regards to the first research question, we have observed that there is certainly a relationship between parental income, college prestige, and relative application and attendance rates. Higher parental income groups tended to increase relative application rates and attendance rates, while college prestige tended to decrease both. However, we did not do the work to include SAT scores within our analysis, which could have led to different results. In regards to the second question, we observed that the vast majority of more prestigious schools were private schools, while many of the flagship schools(which are all public) were not among the most selective schools. In regards to the third question, we observed that the relationship between relative application rates and relative attendance rates does appear to vary based on parental income and college prestige–highly prestigious schools like Ivy league schools and highly selective private schools had attendance rates that increased faster relative to application rates than other tiers of schools.