Introduction By 2030, jobs that are related to technology are expected to grow by 13% according to the United States Bureau of Labor Statistics. With the worldwide increase in demand for new-age technologies and digital skills worldwide, job search and analysis of the current job market has been more significant over the course of the year. One of the factors to consider when it comes to applying for jobs is salary, in which people consider it as significant over other factors such as location due to financial needs. Many websites such as Glassdoor, Data.gov, and LinkedIn provide reports that show a job title with its description and each focusing on different types of characteristics such as rating, type of ownership and location. And as Carnegie Mellon University (CMU) students, we are more geared towards seeking a real world job. Our assumption here is that if one graduated from CMU, a student should either have working knowledge of excel or python. On top of that research inquiry, we wanted to figure out which criteria would be best in terms of seeking an occupation. The response variable is average salary as the measurement. The research questions we brought in this report include at least 6 different features as predictor variables such as state, python, excel, age, minimum salary, and maximum salary.

Description of the Dataset This dataset comes from Glassdoor, a free digital platform that gathers information and reviews from employees or former employees about companies, salaries, and job openings. Specifically, the dataset is based on the year 2017 to 2018. This dataset includes 14 features, with 10 nominal variables, 4 numeric variables. All of the features directly relate to specific characteristics of jobs such as the state the job is located in, whether or not the job requires python or not and average salary, along with its minimum and maximum. Each of the rows represent unique information about different types of jobs. Each of the columns represent different characteristics and factors as mentioned before.

Research Questions Based on the descriptions of different jobs in this dataset, we wanted to answer four specific questions to better understand different characteristics of a job that impact salary. First, we wanted to see how knowledge of python or excel impacted average salary. Second, we furthermore wanted to see if knowing python or excel may influence salary. Third, we checked how salary is distributed across different states in United States. Lastly, we were curious about hierarchical relationship between average salary as the response variable and minimum salary, maximum salary as the predictor variables.

Question 1: Would knowing python or excel impact avg salary?

Our question is if a job requiring python knowledge has an impact on job salaries. Looking at the conditional density plot, we can see that jobs that require python knowledge tend to have higher salaries. python = Yes tend to have salaries around 120-130K, while python = No tend to have salaries around 80-90K. Also, when we perform a one-way ANOVA, we get a p-value of less than 0.05, which means we can reject the null hypothesis that the mean of average salaries are the same across both yes and no. We can conclude that there is a difference in the mean of the average salaries.

Our second question is if a job requiring excel knowledge has an impact on job salaries. Looking at the conditional density plot, we can see that jobs that require excel knowledge tend to have higher salaries. excel = Yes tend to have salaries around 80-90K, while excel = No tend to have salaries around 100-105K. Also, when we perform a one-way ANOVA, we get a p-value of greater than 0.05, which means we cannot reject the null hypothesis that the mean of average salaries are the same across both yes and no. We can conclude that there is no difference in the mean of the average salaries.

Question 2: Would knowing python and excel may influence age vs salary?

Above we have a graph that shows age vs avg_salary given python and excel. We observe each regression line that corresponds to knowing python or not. We have that 1 is yes they know python and 0 is no they don’t know python. Furthermore, we see that the triangle points are 1 that means yes, whereas the circular points are 0 that means no. From the python = 0 regression line, we observe a weak positive linear correlation as the slope is increasing. From the python = 1 regression line, we observe a weaker positive linear correlation than python = 0. It seems almost negligible to say that there even is a relation between average salary and age. Through this observation, we can assume that knowing python wouldn’t really impact one’s salary and age. This is relevant to our overarching question on the average salary one can receive because we question whether knowing a certain skill set would impact one’s salary.

Question 3: How does the average salary differ across the United States?

When determining the average salary of a specific job, many variables come to mind such as location, job requirement and size of company. The main variable for inspection was the state the job was located,in which heatmap of average salary in the United States was drawn upon to inspect any outliers or plethora of missing values. As our objective is to find variables that impact salary, one of the methods to calculate the impact of the significance of job state was to draw a map to indicate impact across United States. As seen above, there does not appear to be any state that has an extreme average salary compared to others besides Missouri and New Jersey, which appears closer to higher salary with the color being slightly darker than others in the heatmap. With 12 missing state information and 38 state information, we have 76% of the data for researching into average salary, we can conclude that there is sufficient information that average salary in the United States do not differ significantly between states.

Question 4: What is the hierarchical relationship between average salary as the response variable and minimum salary, maximum salary as the predictor variables?

We question this because we wanted to know as a follow up question from question 1 about salary. As we analyze the dendrogram including minimum and average salary, we might be able to answer whether there is a hierarchical relationship between salaries. From single linkage, we see that it is dominantly pink, which means the leaves of the tree are most likely to be the minimum salary. From complete linkage with k = 3, we see that the pink branch, to be minimum salary, green branch to be maximum salary, and blue branch to be average salary. It separates for 400K, and 300K. We can say that the jobs that are considered to give minimum salary are less than those that give maximum salary or average salary. From the complete linkage graph, we can see that jobs generally give their employees predominantly the average salary amount. Here, looking into complete linkage provides insight on how jobs are offering the average salary amount. It is also important to notice how jobs offer more maximum salary than minimum salary.

Conclusions After this extensive report, we postulate that for a CMU student to get a high paying salary job, it is better to seek jobs that already pay higher salaries, which are in the states of Missouri or New Jersey. We make this conclusion by analyzing graphs generated by trying to answer the research questions above. Notice that this is all due to the safe assumption that if one is a CMU student, he or she is capable of utilizing excel or python. So, the questions still unanswered are what’s the percentage of CMU students who know excel, python, or any other programming language that will be useful in the real world.