Introduction

In this paper, we will be studying how scores, and consequently rankings, are determined for various universities. It is quite common to refer to university rankings when choosing a university to attend or for determining the prestige of a school, so the score it receives is crucial. To do our exploration, we will be using data with the Quacquarelli Symonds World University Rankings and scores.

Data

The dataset we are using is from the Quacquarelli Symonds World University Rankings, from the years 2017 to 2022. The QS ranking is approved by the International Ranking Expert Group and is known to be one of the top three read university rankings. There are 1368 unique universities represented in this dataset, with 6482 rows and 15 features for each record.

The data has 15 features:

  1. university - name of the university
  2. year - year of ranking
  3. rank_display - rank given to the university
  4. score - score of the university based on the six key metrics mentioned above [0-100]
  5. link - link to the university profile page on QS website
  6. country - country in which the university is located
  7. city - city in which the university is located
  8. region - continent in which the university is located
  9. logo - link to the logo of the university
  10. type - type of university (public or private)
  11. research_output - quality of research at the university
  12. student_faculty_ratio - number of students assigned to per faculty
  13. international_students - number of international students enrolled at the university
  14. size - size of the university in terms of area
  15. faculty_count - number of faculty or academic staff at the university

Research Questions

For our dataset, we came up with three main research questions we wanted to answer.

The first is finding out which factors affect a university’s score the most, since it is the most crucial value that is used in the final overall ranking of a university for the Quacquarelli Symonds World University Rankings. By identifying how each factor plays a role in this score, we can see how universities can change to increase their score, or what types of univerisities are preferred by this scale.

Second, we want to learn more about how scores are affected over time. This dataset includes rankings of universities across the years from 2017 to 2022, so we can see how the rankings might change as the years go by, as there might be some discrepancy in how the scoring occurs.

Lastly, we are interested in seeing how the location of the university, whether it be the overall region or specific country, affects its score. Do certain areas tend to have higher scoring universities? Furthermore, is there a country or region that is preferred by international students in particular? Through our data exploration and visualizations, we hope to answer these questions.

Which factors affect a university’s score the most?

We performed principal component analysis with the 4 quantitative variables in our dataset: score, faculty_count, international_students, and student_faculty_ratio. The principal components are linear combinations of the original variables. The angle of the vectors tell us the correlation between variables. We are specifically interested in which variables are correlated with score. In this case, score and faculty count are in very similar directions indicating that there is a strongly positive relationship between these two variables. In other words, higher scores for universities are strongly associated with higher faculty count. The angle between the vectors for score and international students is less than 90 degrees which also indicates a positive relationship. Higher scores are also associated with more international students. Since the vectors for score and student faculty ratio are orthogonal, this suggests that these two variables may not be correlated.

Based on the scatterplot above, we can see that in general, there appears to be a negative relationship between the student-faculty ratio of a university and its score. In other words, it seems that when the ratio of students to faculty is higher (i.e. more students than there are faculty), the score of the university is lower. Another observation that we can take into account from this visualization is that there appear to be a higher number of universities that have a lower student-faculty ratio in this dataset, and very few at the high end with a 40:1 ratio. We also added an estimate of the linear regression line for this relationship, and can see that there appears to be a non negligible negative slope, which implies that we should do a linear regression analysis on this data.

## 
## Call:
## lm(formula = score ~ student_faculty_ratio, data = universities)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.768 -14.075  -3.615  10.988  45.439 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           59.33157    0.79374   74.75   <2e-16 ***
## student_faculty_ratio -1.19271    0.06709  -17.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.83 on 2714 degrees of freedom
## Multiple R-squared:  0.1043, Adjusted R-squared:  0.104 
## F-statistic: 316.1 on 1 and 2714 DF,  p-value: < 2.2e-16

From the output of the summary of the linear regression of score against the student faculty ratio, we can see that the estimate of the student faculty ratio slope term is -1.17 with a p-value of near 0, which is less than 0.05, meaning that the value is statistically significant. This means that for each increase in 1 of the student faculty ratio, we associate an estimated decrease in score by 1.17 points. The intercept, in this situation is irrelevant because a student faculty ratio of 0 is impossible, so we will not interpret this value.

The visualization above represents a density histogram plot of the score of each university filled by their research activity level. The x-axis represents the score from 20 to 100, where 20 is the lowest score given to a university. The y-axis represents the density of the number of universities at a specific score level, ranging from 0 to 0.5. There is also a fill variable that specifies the research activity level with possible values of “Low”, “Medium”, “High”, and “Very High”. The plot highlights higher scored universities primarily having very high research activity, while lower scores have large densities of low and medium research activity. Overall, there are larger densities of research activity in lower scored universities than higher scored universities. This is expected because in a previous visualization it was shown that the lower scored universities also have higher student and faculty ratios. From our research question about factors around score, it’s clear that research activity contributes to the score of a university and can be seen in lower scored universities.

Next, we can explore how the size of a university is relevant in the score it is given. Based on the boxplot of the score of the university given the size of it, we can see that there is a more notable difference between the distribution of score given that the university is large or extra-large. We can see that the median score for small or medium universities is the same, at around 34, but the median for a large university is 42 and for the score an extra large university is even higher, at around 48. Therefore, it appears that in general, the distribution of score is shifted right with a higher median for larger universities.

How are scores affected over time?

The visualization above represents a time-series plot over individual years for the average score of each university. The x-axis ranges from the year 2017 to the year 2022, while the y-axis ranges from an averages score of 0 to 100. Overall, the average score doesn’t change significantly through each year. However, it does decrease below an average score of 50 from 2018 to 2019 and hasn’t gone above 50 since. From our research question about score over time, it’s clear that the average score alone shows very little change over time.

This graph shows the count of the top 10 countries with the most universities in our dataset by year. The highest number of counts comes from the United States. This makes sense because globally, universities in the United States are popular. The United Kingdom follows in second and Germany third. Past Germany, the number of universities is quite small with less than around 100 universities. The distribution of the counts is also similar across all 6 years for each country.

We focus in on the top 3 countries with the most universities in our dataset: United States, United Kingdom, and Germany. We plotted a time series graph based on these different countries to answer the question of how scores change over time. We specifically looked at the average score across an entire year. Across all three countries, there is a significant dip in average score from 2017 to 2022 by about 10 points. The United States and United Kingdom had similar scores across the years. However, Germany, had about a 10 point difference in scores starting off with an average score of below 50 in 2017.

Does the location of a university play a role in its score?

First, we can explore how the scores look on a map of the world for different countries represented in this dataset.

We wanted to explore how the location of a university can play a role in its score, which prompted us to analyze the country and score variable. The graph above represents the median scores of universities from each country. Since the distribution of universities is skewed, meaning that some countries have more university scores in the data than others, we decided it would be appropriate to compare the median score per country as opposed to the average. As noted in the graph, the color red indicates a higher score, which is more common in European countries like Switzerland and the Netherlands, as well as Asian countries, like Singapore and Hong Kong. The North American region received average rankings compared to countries in other regions. We also note that there is a lack of universities from Africa, which explains the lack of color in the Africa region.

Then, we wanted to further examine the relationship between scores and number of international students for each region, which leads us to examine the international_students, score, and region variables. From this faceted scatterplot, we can see that there is a slight positive correlation between international students. The positive relationship is most notable in Europe, North America, and Oceania, implying that as the Scores for universities increase in these regions, the number of international students who attend the universities tend to increase as well. This graph also shows that there are more international students in Asia, Europe, and North America, which suggests that these are the regions most attractive to international students for university education.

Main Takeaways

We used principal component analysis to perform some preliminary relationships between score and other quantitative variables. From this, we assessed that score had the strongest correlation with faculty count. One notable variable that we observed to have a statistically significant negative relationship with the score of the university, telling us that it is beneficial for universities to have a lower student-faculty ratio to achieve a higher ranking. Furthermore, we found that higher scoring universities had a higher output of research activity. Lastly, we found that the distribution of score for universities of a larger size has a slightly greater center.

It was interesting to see there was a general decrease in score for universities from 2017-2022. This became more prevalent when split amongst the different countries, where each country showed a noticeable decrease from 2018 to 2019. Since 2019, the average score hasn’t gone above 50.

Taking a look at the final question, we analyzed the scores of universities by country, and we found that European and Asian countries tend to score higher than other countries. Then, looking deeper into how international students are affected by location, we noticed that students are drawn to universities in Asia, Europe, and North America. Furthermore, the number of international students has a positive relationship with the university score in the Europe, North America, and Oceania regions.

Future Work

In future works, we can include other measures of performance for universities, such as the rate at which graduates are employed after graduation. This could also lead to other metrics such as the average salary received upon graduation for each university. Most people value these metrics upon choosing which universities to apply to or to attend, implying that these metrics may have a relationship with the score and rankings results in the dataset.

Additionally, another factor that would be interesting to look into is how the primary studies of students at these universities affects the score. For example, do universities that focus on the medical field or technology receive a higher score? One university could allocate the most resources or have the biggest faculty for a particular college or major, which might be preferred by the scoring and ranking system of the Quacquarelli Symonds World University Rankings. In the future, we could look into datasets with information about the breakdown of fields of study for each university.

It would also be interesting to explore more about how COVID-19 affected university rankings. In many of our conclusions, there was a clear change from 2020 to 2022. It’s possible these changes were influenced or related to the effects COVID-19 had on higher education with many universities shifting to remote workflows.

Taking into account and exploring these various aspects can lead to a more robust dataset and more concrete interpretation of what influences a university’s ranking on a global scale.