India is one of the world’s largest hubs for engineering talents. The country produces one million engineering graduates annually and its technical education infrastructure includes an impressive 3500 engineering colleges (Wikipedia). As such, India provides a very rich database for profiling and analyzing engineering students.
This project aims to explore the factors that affect an Indian engineering graduate’s salary using the Engineering Graduate Salary Prediction dataset on Kaggle. This dataset contains 2998 instances of engineering graduates in India. For each graduate, the dataset records 34 attributes, which can be divided into five categories:
Demographic Information: gender, date of birth
High School Performance Records: grade 10 exam score, grade 12 score, year of high school graduation
College Information: degree pursued, specialization, GPA, city ID, city tier, state, graduation year
AMCAT (Aspiring Minds Computer Adaptive Test) Subject Scores: English, Logical, Quantitative, Domain, ComputerProgramming, ElectronicsAndSemicon, ComputerScience, MechanicalEngg, ElectricalEngg, TelecomEngg, CivilEngg
Personality Test Scores: conscientiousness, agreeableness, extraversion, neuroticism, openness to experience
Salary: annual CTC (total salary package) offered to the candidate in Indian Rupee
Given that the overall objective of the project is to determine what factors affect an engineering graduate’s salary, we decided to explore the following research questions (for question 2 and 3, salary is our y-variable):
\(\;\;\;\;\) 1. What does a typical engineering student from the dataset look like?
\(\;\;\;\;\) 2. What quantitative factors affect an engineering graduate’s salary?
\(\;\;\;\;\) 3. What qualitative factors affect an engineering graduate’s salary?
In order to study the typical profile of an engineering student, we decided to start with an exploration in their academic domains and personality. We explored the following questions:
By a brief look at the students’ AMCAT Engineering subject tests scores, I realized lots of students took ComputerProgramming with other subjects, while fewer took each of the other subjects and those subjects tend to be more mutually exclusive (but not completely!). So I obtained and plotted the marginal distribution of students taking each subject and colored by whether they took ComputerProgramming. It makes sense to see that most of the students who took ComputerScience also took ComputerProgramming, and only a fraction of the students who took MechanicalEngg took ComputerProgramming. We can also see that CivilEngg isn’t a very popular subject overall.
We also want to explore the top personality traits of engineers. In order to do such exploration, we created heat map of conscientiousness and agreeableness, as well as agreeableness and openness-to-experience.
From these graphs above, we can see that the engineering students based on our dataset have highly similar personalities. All the three heat maps only have one mode. We can see that the top two personality traits of engineering students are openness to experience and agreeableness. The first heat map regarding these two variables are high in density with extremely high openness to experience and extremely high agreeableness. The other two heat maps with conscientiousness on the x-axis shows that the level of conscientiousness is more spreaded out among these engineering students, indicating that conscientiousness is a relatively unimportant personality trait of engineers here.
For the quantitative factors, we first conducted a PCA analysis to select most important quantitative features affecting salary.
Since we have many X variables, we first conducted a pca analyses to see which variables are most correlated with the students’ salaries and whether they have positive or negative correlations. We can see from the biplot that a student’s test scores, both in school and in standardized tests, are most positively correlated with his/her salaries. For personalities, neuroticism is negatively correlated with a person’s salary, which did make sense as we would expect engineers to be calm and emotionally stable during the work. While some other characteristics, such as conscientiousness, agreebleness, and openness-to-experience, are positively correlated to the salaries. One thing to note is that the vector of extravesion is roughly 90 degrees away from the vector of salaries, which means these two variables are uncorrelated. Some limitation of the pca analyses are:
As all the above analyses are drawn from the biplot which only helped us visualize Dim1 and Dim2, and according to the scree plot, we can see including 6 dimensions can roughly capture most of the variations in the data, we should use other plots to look closer at specific variables.
Because many students didn’t take Electronic and Semiconductor, computer science, mechanical engineering, electrical engineering, telecom engineering, and civil engineering tests, to ensure we have enough data to conduct pca, these variables are not included in the test.
In both tiers of college, salary increases as the GPA is higher. Such increase is similarly steep for tier 2 college students and tier 1 college students. However, it is obvious that tier 1 college students generally have a higher salary compared to tier 2 college students with the same GPA. We then conduct a regression analysis to further justify our observation.
To further validate our observation from the scatter plot, we decided to conduct a regression analysis.
##
## Call:
## lm(formula = Salary ~ collegeGPA * CollegeTier, data = stu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -347591 -112007 -19363 62394 3725799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 295482 105312 2.806 0.00505 **
## collegeGPA 1918 1410 1.360 0.17384
## CollegeTier2 -229399 111180 -2.063 0.03917 *
## collegeGPA:CollegeTier2 1283 1495 0.858 0.39077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207500 on 2994 degrees of freedom
## Multiple R-squared: 0.04551, Adjusted R-squared: 0.04455
## F-statistic: 47.58 on 3 and 2994 DF, p-value: < 2.2e-16
First we would like to test if the two lines in the above graph are significantly different and our regression analysis results have proved that they are significantly different. We can see that the coefficient involving just the categorical variable (in this case, the collegeTier2 term) denotes the change in intercept between the two lines. We see observe a negative value of -229399 with a small p-value less than 0.05, thus providing sufficient evidence to reject the null hypothesis that the true difference in intercepts is equal to zero. We also made the observation that the slopes of the two lines seem identical. Thus, based on the regression analysis, we would like to test if the slopes are significantly different as well. We found that they are not significantly different and we are not able to reject our null hypothesis that the true difference in slopes is equal to zero since the p-value for collegeGPA:collegerTier2 is 0.39077, which is larger than 0.05. Here we confirm that the slope is similar for student in tier 1 and tier 2 colleges.
AMCAT Engineering subject tests scores are hard to include in the PCA analysis due to its optional nature. However, it plays a role in the students’ salary.
From the scatter plot of salary over the average score of all the AMCAT Engineering subject tests that each student takes, we see a positive correlation overall. Although the slope of the trendline is not high, the trend is steady, with very narrow 0.99 confidence intervals (barely visible). Notably, the student’s salary clusters below 1000000 Rupee, with some high-salary outliers.
Besides scores, how many AMCAT Engineering subjects tests a student chooses to take can be interesting to look at. People with interdisciplinary backgrounds are popular nowadays. So do students who take multiple subjects have a salary advantage? Surprisingly, no. The average score distribution of different subject number categories provides a plausible explanation.
As mentioned, the majority of students have a salary under $1000000 Rupee. Across the number of subjects, the medians and IQR ranges are pretty similar. However, most high-salary outliers belong to the 1-2 subjects categories. This result makes more sense when we take the average score into consideration as well–students achieve higher scores on average when they choose to take tests in fewer subjects. None of the students’ in the 3-4 subjects categories get an average score above 600, while some students taking only one test and a few students taking two tests get to that score range. To get a higher salary, it’s better to focus one’s time and energy to build a specialization.
The plot shows that the median amount of salary offered to engineering graduates of the most frequently occurring specializations (with more than 100 students) in the dataset are all somewhat short of 5000 USD per year. Among these specializations, computer engineering graduates appear to have a slightly higher median salary than other engineering students. It also has the largest outliers, suggesting that these graduates can potentially earn a lot more than students of other fields.
In terms of gender, the distribution of female and male graduates’ salaries under any particular field is not drastically different. However, if we look at the outliers, which represent graduates with significantly higher salaries than their peers, there are much fewer female graduates in the outliers than male graduates. This suggests that female graduates are probably less likely to be offered a salary that’s beyond the common salary range as compared to male graduates.
This plot above shows that the spatial distribution of salary widely differs between college states. Students who attend colleges in Meghalaya have the highest average salary in USD, followed by Assam, Sikkim, Bihar (in northeastern India), Jammu and Kashmir (in Northern India), and Goa (the small region in Southwestern India).
To confirm that the differences in salary across college states are indeed significant, we performed a visual randomization test. Consider the null hypothesis that average salary does not depend on college state. We informally evaluate what is the chance that the areal distribution just happens to look like the observed figure above.
To do this, we randomly permuted the average salary in the areal data for eight times and then plotted all the permuted salary distribution along with the real observed distribution inserted at a random position within the nine graphs. The resulting graph is shown below.
Since we can immediately spot the observed salary distribution from the figure (second row first column), our observed salary distribution is significantly non-random in terms of geography. Therefore, we can conclude that students enrolled in engineering schools in the states Meghalaya, Assam, Sikkim, Bihar, Jammu and Kashmir, and Goa tend to have a higher salary than students from other states.
Based on our analysis and visualization, we are able to answer our three research questions. To start, regarding the typical engineering students’ profile, we found out that Electronics & Semiconductor Engineering and Computer Science are two top majors among engineering students based on our visualization of the distribution of AMCAT subjects. Most of the students in Computer Science also takes Computer Programming. Additionally, based on our heat map, we found out that engineering students have highly similar personalities, with top two traits being openness to experience and agreeableness.
In terms of the quantitative factors that might affect engineering students’ salaries, we first performed a PCA analysis, which told us that a student’s salary is most correlated with his/her academic performance. Then by using scatterplots, We found that the increase in college GPA will increase their salaries. While such an increase is identical for students in both tier1 and tier2 colleges, tier1 college students generally earn more than those in tier2 colleges, given that GPA is the same.
As for qualitative factors, we examined the impact of specialization, gender, and geography on salary. We found out that although the salary across specializations do not differ by much, male graduates tend to earn a higher salary than female graduates, and the state at which an engineering graduate attended college also affects his/her salary.
One of the limitations in our analysis is that the dataset only contains engineering students in India. Hence some conclusions we have made might be affected by such limitations. For example, while we conclude that engineering students have highly similar personalities based on this dataset, it is also possible that with various nationalities, there might be more differences in personality traits of engineering students. One possible future work would be collecting data on more engineering students from different countries to see if the patterns we found here are applicable to a broader context.
The other thing to note is in our data, many students didn’t take the AMCAT subject tests. If we can collect more data on students’ score in these subject tests, we can do some further analyses on correlation between a student’s grade in certain subjects and his/her salary level in the future.