Introduction

For this project we worked with a dataset curated via Kaggle containing information focusing on the pay gap of various job titles based on gender from Glassdoor. An ongoing global concern that consist of women being paid less than men with the same job titles, so through this data collected from Glassdoor, we intend to determine the depth of gender pay gaps and quantify this difference in pay. In particular we aim to answer the following questions:

  • What is the overall difference in pay between males and females?
  • Which job title/department has the highest difference in pay between males and females?
  • How do education levels and age affect differences in pay?
  • What quantitative variables in the dataset seem to be the most important in determining gender pay?
  • How much more do women need to earn on the same level as men for the same type of job title/department across the same demographic variables?

Data Description

The dataset broadly contains information regarding the demographics of workers in several job fields. Each row corresponds to a unique worker and contains the following variables:

  • JobTitle: Type of job.

  • Gender: Worker’s reported sex (female or male).

  • Age: Worker’s reported age in years.

  • PerfEval: Performance Evaluation score (1-5).

  • Education: Worker’s reported education level (High School, College, Masters, PhD).

  • Dept: Type of job department.

  • Seniority: Number of years worked (1-5).

  • BasePay: Annual Basic Pay in dollars ($).

  • Bonus: Annual Bonus Pay in dollars ($).

There are a total of 1000 observations with 9 different variables containing both quantitative and categorical variables. We will look at how these variables interact with each other and the combined effect they have on the gender pay gap. We create a separate dataset to isolate the quantitative variables to examine the most important factors in determining this pay gap. Our focus is to examine the depths of the gender pay gap and consider other variables that may be affecting the gap. Towards the end, we built a simple linear model that makes use of all the variables in the dataset to quantify the differences in pay between men and women. By doing so, we hope to create a more holistic view of pay differences and show our audience the concerns of the wage gap existing still to this day.

Results

General Difference in Pay

First, we will look at the overall difference in annual base pay by gender. To do so, we used a density curve plotting BasePay against Gender to determine the density base pay amount between men and women. As we can see, although not a drastic difference, there is a higher density of women who earn on average, a base pay of less than 100000 dollars compared to men. But there is a higher density of men for annual base pay amounts greater than 100000 dollars compared to women. We already see a difference in pay between the two genders but we will test if this difference is significant in the next part.

Two-Sample KS test

Using the Two-Sample KS test, we tested the difference in distributions of base pay between the two genders. Since a p-value of 0.0002856 is less than our alpha level of 0.05, we reject the null hypothesis that states the two distributions are the same. We have sufficient evidence to suggest that the distribution of earnings of men is statistically significantly different from that of the earnings of women.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  women_pay and men_pay
## D = 0.13335, p-value = 0.0002856
## alternative hypothesis: two-sided

However, to see whether or not it is women that earn less than men, we will carry out another statistical test which is outlined below.

Welch T-test

Using the Welch two sample t-test, we test to see if the mean salary for female workers is larger than the mean salary for male workers. Our p-value is evaluated as 4.359e-08, which is less than our alpha level of 0.05. We have sufficient evidence to suggest that women on average earn less than men as seen by the differences in the mean of base pay earned by both groups (89,942 for women and 98,457 for men). These results match with our original hypothesis based on the density plots.

## 
##  Welch Two Sample t-test
## 
## data:  BasePay by Gender
## t = -5.3918, df = 991.22, p-value = 4.359e-08
## alternative hypothesis: true difference in means between group Female and group Male is less than 0
## 95 percent confidence interval:
##       -Inf -5914.768
## sample estimates:
## mean in group Female   mean in group Male 
##             89942.82             98457.55

Differences in Pay Averages by Job Title and Department

Of the 10 unique job titles used in the dataset, interestingly on average, men earn more than women in 5 of these job titles and women earn more than men in 5 of these job titles. There appears to be only a 50-50 split in the wage differences between the two. Specifically men on average earn more than women in roles of software engineers, sales associates, marketing associates, IT, and as drivers. Since these occupations are typically more male dominated ever since the creation of these jobs, women tend to earn lower than their male counterparts. On the other hand women earn more than men in roles of warehouse associates, managers, graphic designers, financial analysts, and data scientists, for similar reasons. The largest difference comes in the wage gap from software engineer roles where men earn 106,371.49 dollars per year and women earn only 94,701 dollars per year, a difference of 11,670.49 dollars. On the contrary, the smallest difference in earnings is IT followed by financial analysts where the difference is approximately 550 dollars for IT and approximately 850 dollars for financial analysts.

In all the jobs where men earn more than women, there is a difference of 5166.70 dollars whereas in all the jobs where women earn more than men, the difference is only 3651.80 dollars. Essentially the outcome of the differences in pay on average affect women more than it does men. However to test if certain jobs themselves favor certain genders we used a mosaic plot shaded by pearson residuals.

Relationship between Gender and Job Titles

The mosaic plot will test the independence of gender (female or male) and job titles. From this, we see that there are fewer women in managerial, compared to men where there is a greater number of them as managers than expected. This can go along with the idea of how society looks at women as not being suitable in positions of leadership and low competency compared to men who are looked at as highly competent. There is a much greater expectation of females being marketing associates while males have fewer than expected. It probably relates to males being put in other positions that are higher up than females. Lastly, there is far fewer expectations of females working in as a software engineer and greater amount of males working in that field than expected.

In the overlapping ridge plots, we see the difference in pay by gender between different departments, where the pink density curve represents the average base pay for women while the blue density curve represents the average base pay for men. We see that the density curve for men is consistently right shifted for all departments compared to women indicating that men on average earn higher pay than women in the above departments. Administration and Management have the lowest differences as seen by the similarity of the overlapping density plots above, whereas sales and engineering departments have the highest differences.

Differences in Pay Averages by Age and Education Level

Now we incorporate the variables Age and Education to understand the difference in annual base pay based on the different education levels and ages of the workers. We facet the graph into four different education levels: College, High School, Masters, and PhD with ages ranging from 18 to 65. From this graph, we see that women consistently earn less than men regardless of their education level. In general, as the age increases, the wage gap increases (with the exception of masters degree graduates). This gap is more pronounced for those with the highest level of education being a college degree.

Effect of Quantitative Variables on Pay and Gender

The correlation matrix measures the effect of the quantitative variables for men (to the right) and women (to the left).

Through the visual of the correlation matrix, we see that the bonus a woman earns is negatively tied to her age evaluation score whereas this magnitude of correlation is not the same for men (albeit still being a negative correlation). On the other hand, the base pay of women increases with increasing seniority and this is a stronger association than the relationship between base pay and seniority for men. This implies that women need to on average display a higher level of seniority to earn on the same level as men. Interestingly, for both genders, more so for women, having a higher performance evaluation score is associated with a lower base pay and vice versa.

Principal Component Analysis

We used a principal component analysis (PCA) biplot using the quantitative variables from our dataset to determine in what ways Gender is associated with the quantitative variables in the data. We note that while there does seem to be an association between base pay and gender as males seem to trend towards higher base pay than females, the difference is not significant.

Modeling

We used a linear model to help predict the base pay based on gender by accounting for all the variables in the data set. For instance, in this example we tried to find the base pay of both genders while holding every other variable constant. We looked at the base pay differences between the two genders from an individual working in IT at the Operations department, 30 years of age, has a college education level, a performance score of 3, a seniority level of 3 and does not receive any bonus. From inputting these parameters into our model, we predict a base pay of 73,373.49 dollars for men whereas only 72,502.19 dollars for women. Essentially, we see that a typical 30 year old college-educated woman working in IT at the Operations department with slightly above average performance scores and seniority level must on average earn 871.30 extra (maybe as a bonus) so that the pay is equal with that of a similarly profiled man.

## 
## Call:
## lm(formula = BasePay ~ ., data = GPG)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33198  -6993    395   6971  27779 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.661e+04  3.313e+03   5.015 6.31e-07 ***
## JobTitleDriver              -3.712e+03  1.449e+03  -2.562  0.01056 *  
## JobTitleFinancial Analyst    3.965e+03  1.413e+03   2.807  0.00510 ** 
## JobTitleGraphic Designer    -2.498e+03  1.419e+03  -1.760  0.07868 .  
## JobTitleIT                  -1.745e+03  1.439e+03  -1.213  0.22554    
## JobTitleManager              3.155e+04  1.469e+03  21.468  < 2e-16 ***
## JobTitleMarketing Associate -1.649e+04  1.408e+03 -11.709  < 2e-16 ***
## JobTitleSales Associate      2.074e+02  1.433e+03   0.145  0.88497    
## JobTitleSoftware Engineer    1.326e+04  1.412e+03   9.387  < 2e-16 ***
## JobTitleWarehouse Associate -9.420e+02  1.463e+03  -0.644  0.51978    
## GenderMale                   8.713e+02  7.533e+02   1.157  0.24772    
## Age                          1.012e+03  3.854e+01  26.256  < 2e-16 ***
## PerfEval                    -3.195e+02  7.687e+02  -0.416  0.67780    
## EducationHigh School        -1.298e+03  9.099e+02  -1.427  0.15404    
## EducationMasters             4.727e+03  9.130e+02   5.177 2.73e-07 ***
## EducationPhD                 5.776e+03  9.366e+02   6.167 1.02e-09 ***
## DeptEngineering              3.117e+03  1.038e+03   3.003  0.00274 ** 
## DeptManagement               2.642e+03  1.043e+03   2.533  0.01147 *  
## DeptOperations              -4.243e+02  1.013e+03  -0.419  0.67552    
## DeptSales                    6.319e+03  1.021e+03   6.189 8.88e-10 ***
## Seniority                    9.554e+03  2.894e+02  33.012  < 2e-16 ***
## Bonus                        2.426e-01  6.179e-01   0.393  0.69471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10110 on 978 degrees of freedom
## Multiple R-squared:  0.8443, Adjusted R-squared:  0.8409 
## F-statistic: 252.5 on 21 and 978 DF,  p-value: < 2.2e-16

Note: All assumptions for a linear model have been met prior to conducting analyses.

Conclusion

Overall, through our report we see that the differences in wage gap between men and women is statistically significant. In general, men tend to earn more than women in male-dominated fields by an average of $5166.70 with the highest difference in software engineering roles. In male dominated occupations such as those working as sales associates, we see that men on average earn higher than women. But while there is a significantly large number of women working as marketing associates, women in this group still tend to earn less than men, evidence of a bias in the system. Women also on average earn consistently lower than their male counterparts in all departments. Similarly, when accounting for education levels, women still earn less than men with this gap widening more with age. The disparity is especially large for college graduates and low for high school graduates.

Furthermore, women need to on average display a higher level of seniority to earn on the same level as men and although the bonus a worker receives decreases with age, this decrease is more significant in women than in men, i.e women tend to receive a lower bonus for every year that they work compared to men. Although we see that quantitative factors (age, performance evaluation scores, bonus, seniority) only slightly favors men over women in terms of receiving a higher base pay (visualized in the PCA biplot), the disparity in wage gap when accounting for job titles changes the story. Our model towards the end of the report helps quantify the amount women need to earn on the same level as men for the same type of job title/department while belonging to the same demographics.

While our report showcases the existence and the level of wage disparity between men and women, we must be cautious extrapolating our analyses. Our sample data of 1000 observations may not be representative of the entire population of working class individuals in the US and so we warn our readers to take our analyses only within the scope of this subset of data collected through Glassdoor. However still, our results are consistent with the global trend of wage inequality between men and women.

Future Research

For continued research, we hope to include the effects of seasonality and time trends (such as time of hire and periods of global economic instability) as well as other demographic variables such as race, ethnicity and geographic location. This is important as workers from various minority groups and low socioeconomic backgrounds are the most likely to experience the issue of wage inequality.