Final Project: Statistical Visualization and Analysis of QS College World Rankings Dataset

Introduction

By the year 2040, worldwide college attendance is expected to reach 600 million students based on the current growth rates, making the analysis and understanding of college-related data evermore important. One of the most significant factors that students consider when applying to colleges is its ranking relative to other programs, a metric commonly associated with prestige and brand image. There are many websites and publications that release their own personal rankings annually for colleges in the United States and around the world, and each one of them considers a different set of factors when creating that list. In this project, we aim to analyze the factors that Quacquarelli Symonds, an organization approved by the International Ranking Expert Group (IREG), considers when making its annual rankings.

Description of the Dataset

The QS rankings are one of the top three most widely read rankings in the world and are published in partnership with Elsevier. This particular dataset has data from the years 2017 to 2022 and has 15 features in it. There are three university-specific features such as the name of the university, the link to its website, and the link to its logo that are not used in this project. There are four categorical features including year (2017 through 2022), type (Public vs. Private), size (S, M, L, XL), and research output (Low, Medium, High, Very High) and three geographically related categorical features including country, city, and region. Finally, there are five quantitative features including rank, score, student-faculty ratio, number of international students, and number of faculty. Each of the rows of the dataset represents a different university in the rankings and the order is all the rankings for 2017 then 2018 and so on. Each of the columns of the dataset represents one of the aforementioned features.

Research Questions

Based on these rankings and this dataset, we wanted to answer four specific questions to better understand the methodology behind these rankings and their overall implications. First, we wanted to know which of these specific features highly ranked universities share and how heavily do each of them impact the ranking. Second, we wanted to explore further into what differentiates private and public universities from each other and do those differences reflect in the rankings. Third, we were curious whether the geographic location of the universities, a common decision-making factor for most college students, makes an impact in the rankings themselves. Finally, we were curious about how population metrics at a university such as population and the student-faculty ratio impact the rankings.

Question 1: What attributes do highly ranked universities have?

When thinking about what attributes highly ranked universities have, there were a couple variables that came to mind. These variables were the count of students and faculty at the universities, and whether or not the university was public or private. The following graphs dissect how these variables were related with the overall score of the university.

This graph depicts the log of the schools student count versus the score that the school received. One main takeaway is that schools with a higher student count tend to have the higher total score. Furthermore, private schools are skewed towards having a higher student count, but there are simply far more public schools in the dataset. The linear regression lines on the plot also tell an interesting story. Both lines show a linear positive relationship, but private schools with a higher student count generally score higher than public schools with a high student count. This makes sense because the cost of tuition at private schools is generally much higher than the cost of tuition at public schools. If a private school with a high cost of tuition is still getting a large student count, the education must be worth the cost.

This graph depicts the relationship between the faculty count of a school and the score they received. One main takeaway is that the two variables have a positive linear relationship. This tells us that schools with a higher total faculty count tend to have higher scores. This trend follows the same general path across all six of the years in the dataset. This takeaway generally makes sense considering our later findings regarding the student-faculty ratio. In order to have a lower student-faculty ratio, you need more faculty.

Overall, these two graphs both show a positive linear relationship. This leads us to the conclusion that high scoring schools generally have high student and faculty counts. Furthermore, private schools with these counts tend to score higher than public schools.

Question 2: What differentiates private universities from public universities?

To approach this question, we first wanted to gain insight into which feature of our dataset differed most between private and public universities. Our initial guess was that the size feature would differ the most (size refers to geographical area). We imagined that since private universities usually have less students, then the physical size of the private universities would be smaller as well. To examine whether this relationship is true, we found the conditional distribution of university type by size. For simplicity, this graphic only includes data from 2022, since the size of a university is not likely to change year over year.

As we can see, the majority of small universities are private, but as well increase size from small to extra large, the proportion of private universities decreases at every subsequent level. This validates our initial hypothesis and shows us that private universities are physically smaller than public universities. Beyond the features which differentiate private and public universities, we next wanted to investigate whether there was a substantial difference between the university scores they received. Below we plot a violin plot comparing the distribution of the scores by university type.

It appears that the distribution of scores is most different between university type at the extremes. There appear to be a lot more private universities with scores between 80 and 100 compared to public universities, and there appear to be a lot more public universities with scores between 20 and 40 compared to public universities. However, at the middle scores (between 40 and 80) there does not seem to be a substantial difference in the scores. In real world terms, it seems that a higher proportion of private universities are ranked really high compared to public universities. However, for the most part, apart from the top universities in the world, the ranking of private and public universities does not differ much.

Overall, it appears that there may be surface level differences between private and public universities. However, from a ranking/score perspective, the difference is not incredibly noteworthy.

Question 3: How much does the geographic location of a university matter?

To examine this question, we looked at how the geographical location of a university really matters. To determine the location mattering, we used university score to indicate its success of a university. Using visualizations, we can see if the locations of university has an impact on the distribution of universities and their scores. To understand location, we are first going to look at the number of universities in each continent and their scores respectively. To do this, we are going to use a stack barplot in order to primarily compare number of universities in each continent and also be able to compare the quality of universities in terms of score. We are also going to look at 2022 rankings since they are most up-to-date and thus most relevant.

When we look at the our data for 2022, we see that Europe has the most ranked universities followed by Asia, and then North America. Oceania, Latin America, and Africa also have a decent amount of ranked universities but not nearly as many as the other three continents. We see that Europe has close to double the amount of ranked universities as Asia and North America with the difference being in the number of universities ranked from 0-50. Asia and North American have a similar amount of universities 70-90 compared to Europe. When we look at universities ranked from 90-100, we see that North America has the most universities despite having half the ranked universities as Europe. This tells us that even North America has far fewer ranked universities, the quality of universities is much higher in North America compared to Asia and Europe.

To understand the specific distribution of universities, we are going to look at the distribution of universities in the United States. We are going to look at the United States since it is the most relevant and has the most highly ranked universities. We are also going to look at the the score of universities in order to see if the distribution of highly ranked universities matters. By data wrangling, we were able to turn our cities into longitude and latitude so that we could plot them on our map.

When we look at our map, we see that the majority of universities seem to be located on the east and west coast, south, and Midwest area more towards the east coast. We also see that the majority of areas of the coast and major cities have a darker color indicating that there are several universities in these cities. This tells us that population and cities play a massive role in the number of universities.

When we look at rankings, we see that the almost all the highly ranked universities (score > 70) are located on the east coast/slightly more inland towards central American or west coast. Thus we can see that being located on the coasts/mid west area is correlated with highly ranked universities as the majority of highly ranked universities lie in these areas while the universities across other areas in the United States have lower rankings especially on average.

Overall, we see that the highest ranked universities are in the United States while the most universities overall from our dataset indicate that Europe has the most universities. This might be unreliable, however, since China is notorious for not releasing data. When we looked at the distribution of universities in the United States, we see that the majority of universities lie on the coast which are typically densily populated. We see this with many universities being located in the same cities with this not really being the case more towards the Midwest.

Question 4: How do population metrics of a university impact its rankings?

In order to approach this question, we decided to look at the features most directly related to population which were student-faculty ratio, size, faculty count, and international students. The intuitive initial reasoning that we had for this relationship was that universities with lower student-faculty ratios, more faculty, and less students overall would be ranked and scored higher because they represent universities that can offer a lot more individual attention to each student. In order to find evidence for this theory, we first began by making a cluster map that showed the student-faculty ratio against the score for that university, faceted by size. The cluster map was chosen because it would be able to either confirm or deny the theory by showing clusters around the most common instances of student-faculty ratio and score across the various sizes.

Based on the cluster map for small schools, the data clusters around low student-faculty ratios, which was expected, but also low scores on the ranking scale, which goes against the reasoning proposed. For medium schools, the data clusters around the same area as the small schools, but has a few clusters in that region rather than the one. For both large and extra large schools, the data clusters around slightly higher student-faculty ratios, which was expected, but also pretty low rankings. Taken together, these cluster maps show evidence against the originally hypothesized theory and seem to indicate that student-faculty ratios and sizes are not very indicative of the rankings for the school.

To continue examining population metrics, we turned our attention to international students, with the intuition that schools with more international students would likely be ranked higher because they are able to attract talent that isn’t home-grown. In order to find evidence for this theory, we decided to make a linear regression graph by school size once again. This type of graph would allow us to see all of the points between international students and score, but also compare by size once again.

Based on the linear regression graph, the first thing to notice is that based on the scatterplot itself, there does seem to be a positive association between international students and the score of the university. Breaking it down further into the sizes, the small schools are most heavily influenced by this positive association whereas the extra-large ones are now. Thus, it does seem like the population of international students does impact the score of a university, but it would also be helpful to have additional data on the percentage that international students make up rather than the raw number.

Overall, it seems like the answer to our fourth research question is that population metrics for international students seem to have an impact, but student-faculty ratio does not. Additionally, it seems like the larger the school, the less the student data has an impact anyway, likely due to the sheer size making population more of an afterthought and more qualitative things like research output being more important.

Conclusions

By using this dataset, we were able to answer the four research questions we had originally set out to learn more about. Our findings led us to some interesting conclusions regarding those questions. In regards to our first research question, we arrived at the conclusion that high scoring schools generally have high student and faculty counts. Furthermore, private schools with these counts tend to score higher than public schools. In regards to our second research question, it appears that there may be surface level differences between private and public universities. However, from a ranking/score perspective, the difference is not incredibly noteworthy. In regards to our third research question, we see that the highest ranked universities are in the United States while the most universities overall from our dataset indicate that Europe has the most universities. This might be unreliable, however, since China is notorious for not releasing data. When we looked at the distribution of universities in the United States, we see that the majority of universities lie on the coast which are typically densely populated. We see this with many universities being located in the same cities with this not really being the case more towards the Midwest. Finally, with regards to our fourth research question, it seems like population metrics for international students seem to have an impact, but student-faculty ratio does not. Additionally, it seems like the larger the school, the less the student data has an impact anyway, likely due to the sheer size making population more of an afterthought and more qualitative things like research output being more important. All of these findings go a long way in understanding the ranking system of QS better, and help in using it to make other decisions as well.

Further Research and Potential Implications

Although this analysis of the QS dataset is a good start, there are multiple ways in which it can be extended and strenghtened. First of all, there could be similar analyses run on ranking datasets from sources like US News, Forbes, and Poets and Quants, which are three other ranking sources that people tend to believe when encountered with this situation. Additionally, there were some variables in this dataset that would have been better in altered forms for the sake of research, such as percent of international students rather than international students and additional data such as the actual geographical size of the campus in square area as well. Regardless, this analysis has implications in both policy and personnel decisions that can be useful. An incoming student could use this data to better understand the types of things that make up their own school’s score and create their own ranking. Additionally, a school can use this data to understand how to improve their own rankings and prioritizing different areas for the budget. This dataset definintely provided some interesting conclusions and has a lot of potential for further analysis and implications, making it a great choice for a project like this one.