36315 Final Project

Introduction

Data Science has become the sexiest and fanciest field in the 21st century. It is reported that almost EVERY industry is thirsty for people with expertise in Data Science and big names such as Google and Meta often offer high pays to attract such experts. Many people thus arbitrarily Choose to pursue a career in the field of Data Science, and most new graduates with a relevant major in statistics and data science, or even rather irrelevant majors such as business or history, also look forward to breaking into the field of Data Science. However, is this the whole picture? Is a career in Data Science really worth pursuing? Do most industries really have large vacancies for data science positions? Are the pays really that high? What skill sets are required in order to get a high-paying data scientist job? Those are great questions worth examining. In our project, we’ll specifically dig into different factors associated with salary. We’ll break down our analysis into three parts.

First, what does the overall job market look like? Specifically, we’ll dig into the number of jobs opening by each job title category and industry.

Second, is there any relationship between the characteristics of a company and the salary it can offer? Specifically, we’ll look into a company’s size, rating, location as well as industry and job titles. For instance, which industry offers the highest pay, and which specific job title within the field of data science gets paid the most. We’ll examine industry and job titles in this case.

Finally, we’ll look at the core skills associated with the salary. What are the key skills required to lead to a high-paying job, and what characteristics do most data science positions look for in candidates? We’ll examine the first question by looking at multiple skill set variables and the second by looking at the job descriptions and performing text analysis.

Here we have a dataset from Kaggle. We’ll use this dataset to answer the questions we proposed and give an insight to students who wish to go into the field of Data Science.

Data Description

Our dataset contains 742 rows. Each row represents one job posting information related to the position of ‘Data Scientist’. There are 42 variables. Below are the variables contained in the dataset:

Index: index of the observation.
Job.Title: Title of the job. eg. Data scientist, junior data scientist, etc.
Salary.Estimate: Range of salary and the source.
Job.Description: The qualities that the companies want and what is expected out of the job title.
Rating: The rating of the company. Ranging from 1 to 5.
Company.Name: The name of the company.
Location: Location of the job.
Headquarters: Location of the headquarters of the companies.
Size: Range of the number of employees working in the company.
Founded: The year that the company was founded.
Type.of.ownership: The type of the company’s ownership.
Industry: The industry that the job is in.
Sector: The sector that the company works.
Revenue: Total Revenue of the company each year (in billions USD)
Competitors: The current competitors of the companies.
Hourly: 1: If salary was reported in the hourly rate. 0: Otherwise.
Employer.provided:1: If the salary was provided by the employer 0: Otherwise.
Lower.Salary: Minimum Salary reported for the job in a particular company (in K USD).
Upper.Salary: Maximum salary reported for the job in a particular company (in K USD).
Avg.Salary.K.: Average salary reported for the job in a particular company (in K USD).
company_txt: The name of the company.
Job.Location: Job location’s state abbreviation.
Age: Age of the company.
Python: 1: If Python skill is required 0: Otherwise.
spark: 1: If Spark skill is required 0: Otherwise.
aws: 1: If AWS skill is required 0: Otherwise.
excel: 1: If Excel skill is required 0: Otherwise.
sql: 1: If Excel SQL is required 0: Otherwise.
sas: 1: If SAS skill is required 0: Otherwise.
keras: 1: If Keras skill is required 0: Otherwise.
pytorch: 1: If Pytorch skill is required 0: Otherwise.
scikit: 1: If Scikit skill is required 0: Otherwise.
tensor: 1: If Tensor skill is required 0: Otherwise.
hadoop : 1: If Hadoop skill is required 0: Otherwise.
tableau: 1: If Tableau skill is required 0: Otherwise.
bi: 1: If PowerBi skill is required 0: Otherwise.
flink: 1: If Flink skill is required 0: Otherwise.
mongo: 1: If MangoDB skill is required 0: Otherwise.
google_an: 1: If Google analytics certificate is required 0: Otherwise.
job_title_sim: Simplified job title.
seniority_by_title: Seniority in the title (na or sr).
Degree: M: If the job title requires it or provides experience years for having it. P: If the job title requires it or provides experience years. na: not available or not applicable.

In this dataset, -1 represents that the related information is not available or not applicable. Since the rows containing -1 are only of a small proportion of the overall dataset, we thus drop these rows.

Research Questions

In this project, we want to ask the following research questions:

What does the overall job market look like?

What does the conditional distribution of data science jobs given job title category looks like?
What does the conditional distribution of data science jobs given industries looks like?

Is there any relationship between characteristics of a company and the salary it can offer?

How will salary range differ for different industries?
How will salary range differ for different job titles?
How will average salary differ for different areas?

What are the core skills associated with data scientists’ salary?

What skill requirements are more predominant among the jobs?
How will the skill requirements be related to salaries?
- Detailed Analysis: Average salary and Python requirement
- Detailed Analysis: Average salary and Excel requirement
- Detailed Analysis: Average salary and sql requirement
What are the most common words in Job Descriptions?

What does the overall job market look like?

The conditional distribution of data science jobs given job title category

We would like to understand the job vacancies in data science and their distribution across industries and job titles. Hence, we first examine the conditional distribution of data science jobs given job categories to understand the distribution of vacant data science jobs across job categories/titles.

The above graph indicates that most vacant data science jobs fall under the category of “analyst,” and the vacancy of jobs decreases with higher and more prominent titles. There are only a few openings for “director” and “machine learning engineer.” There are more openings for categories like “analyst”, “data analyst”, “data engineer”, and “data modeler”, while much fewer openings for categories that would demand more expertise and experience.

The conditional distribution of data science jobs given industries

We also would like to better understand the distribution of job vacancies across industries. To do this, we plotted the conditional distribution of data science jobs given industries with a bar graph.

The above graph suggests that Staffing and Outsourcing, Social and Governmental Organizations, and Research and Development industries create the most job vacancies for data science, while Consumer Products and Manufacturing and Commodity and Manufacturing have the fewest job openings for data scientists. Meanwhile, IT-related, Healthcare, and Finance-and-Insurance-related industries have approximately the same number of job openings, less than the R&D industry but more than the Entertainment industry.

Is there any relationship between characteristics of a company and the salary it can offer?

Moreover, we would like to understand whether or not there exists a relationship between characteristics of a compnay and the salary it could offer. In order to acheive this objective, we will first examine Salary Range by Industry.

Salary Range by Industry

The plot shows the range of salaries by different industries. From the plot we can see that the IT related industry has the highest upper salary, mean salary, and lower bound salary. Because the original dataset has upper level salary, average salary, and minimum salary as three separate variables, with this single plot we are able to integrate those variables and examine the relationship of salary and industry.

We then look at Salary Range by Job Title

Salary Range by Job Title

Although all being considered as “data-related” titles, those positions actually differ a lot in their specific functionalities, skills required, as well as salaries. Here we will specifically examine the salary range given different job titles. The two titles that yield the highest salary range are “Director” and “Machine Learning Engineer”. They both have a maximum salary level beyond 150k and a minimum salary level almost 100k. The lowest two are “Analysts” and “Data Analysts”. Their maximum salary level is only around 85k and could not even reach the minimum level of machine learning engineer. It’s very interesting to discover this huge difference of salaries across different job titles even within the field of data science. So why do machine learning engineers earn this much? What’s actually causing this difference? We’ll further examine this in the following sections.

It is also an interesting question to understand how salary varies across different states.

Average Salary By State

We would like to understand how the average salaries would differ across different states. Hence, we use overlap the heatmap over the US map to show the different average salaries of “data-related” jobs for different states. The gray areas represent that the state has no related data in this dataset. Brighter orange represents the higher average salary within the state. From this plot, we can see that California and Illinois have higher average salaries than the other states. Undoubtedly, these states have higher demands for data science jobs and thus may have higher average salaries for “data-related” jobs. When looking for such jobs, these states could be the best choices to consider about. On the other hand, the states like Idaho and Nebraska have relatively lower average salaries. This doesn’t necessarily mean that these states are bad choices for data-related jobs, as these states might also have lower average living costs.

What is the relationship between skill requirements and salaries?

Now we want to understand the relationship between skill requirements and salaries. We first look at the number of different skills required in the data science jobs. ### What skill requirements are more predominant among the jobs?

This plot displays the number of skills required in all data scientist jobs we collected. From this plot, we can see that python, excel and sql are the most demanding skills based on our dataset. Further, we can see that flink is the least demanding skill. From the plot, we can easily understand what kind of skills are in great demand vesus what kind of skills are less important in seeking a data scientist job.

How will the skill requirements be related to salaries?

To answer the question we posted in the previous section of what’s actually causing a difference in salary among different job titles, we use a correlation plot to examine the correlation. We’ll specifically look at the last three rows, which shows the correlation between salaries and skill sets. We can see that the most relevant skill is Python (without doubt), then we have scikit, tensor, and keras, which are all Python-related skills. Besides Python, we have SQL, AWS, Hadoop, and MongoDB, which are either Cloud or Database, showing a somewhat strong correlation of around 0.5. In fact, machine learning engineers’ daily work is building models using machine learning algorithms, and for now Python, especially the sklearn, tensor, and keras packages are the major tools for machine learning. That’s why those skills are so relevant when it comes to high pay. Cloud and Database are more of data engineering things. From the salary error plots we can see data engineers are also getting high salaries.

Then, we continue to see the relationship between required skills and salary. We analyze if there is a salary difference between jobs that does not demand python, excel or sql and jobs that demand the three skillsets.

Average Salary & Python

We begin with the comparison of the average salary between jobs that require python and jobs that does not require python:

Although there appears to be overlapping parts, in general, the average salary for jobs that require python skill are higher than those do not require. Both the lowest average salary and the highest average salary for jobs requiring python are higher than those for the jobs not requiring python, and the peak of average wage for jobs requirng python is also higher than that for jobs not requiring python.

## 
##  Pearson's Chi-squared test
## 
## data:  salary$Avg.Salary.K. and salary$Python
## X-squared = 401.68, df = 218, p-value = 4.921e-13

We can conduct a chi-square test here to check if there is dependence between salary and python skill requirement to check our result. According to the chi-square test, we get the p-value of 4.921e-13, which is quiet low. Thus, we have enough evidence that there is relationship between average salary and python skill requirements.

Average Salary & Excel

Then we continue with the comparison of the average salary between jobs that require python and jobs that does not require excel:

There does not appear to be significant difference for the average salary between the jobs require excel and those do not require. Both the highest average salary and lowest average salary for jobs requiring excel is lower than jobs not requiring python, and the peaks almost overlap.

Average Salary & sql

Finally, we see comparison of the average salary between jobs that require python and jobs that does not require sql:

We can see that there’s even smaller difference on average salary between the jobs requiring SQL and the jobs not requiring SQL. The range of average salary for jobs requiring SQL falls completely withint the range of that for jobs not requiring SQL, and the peak for jobs requiring SQL is only slightly higher than that for jobs not requiring SQL.

In conclusion, in the three top required skill sets, python makes the most difference on average salary, and excel and SQL do not seem to be related to a higher average salary.

What are the most common words in Job Descriptions?

Apart from simply listing the required technical skills, what are other qualifications companies are looking for from candidates? We’ll examine this through a word cloud of job descriptions. We can see that the most common word is no doubt “data”. Another interesting word is “experience”. It seems like companies are all looking for candidates with some relevant experiences in the past. That’s sad because interns and new grads might have no such experience. “Analysis” and “Development” show up without any surprise, but we see “business” appears as quite common. In fact, most companies hire data workers not only to process the messy data, but also to generate useful business insights from the data and thus drive the growth of the company. That’s why data workers are more of a business need. This might also give us some insights that core technical skills are indeed required, but some knowledge or experience in business might also be helpful.

Conclusion

Here we have a salary range of entry level software engineers proposed by Glassdoor. We can see that the range is from 57K to 205K. When we look back at the salary range by job title graph, we can see that especially data analysts, modelers, and project managers, regardless of their level, couldn’t even make as much as an entry level software engineer. The required skills and qualifications, on the other hand, are a lot more diverse as well, indicating you might need to master Python programming, Machine Learning, Database, and even Cloud to secure a high paying job in the field of data science. Oh, and don’t forget to acquire relevant experience before you apply! From our analysis of salary, job market, and relevant skills, the most important insight we get is that data science is never a field as favorable as it sounds like. With most positions being analyst-level positions with relatively low pay and the most technical skills required, new grads might need to take a second thought when going after data science positions.

Future Research

One of the questions we have not addressed while worth considering is the level of education one needs to secure a data science related job. With the intensifying competition, what kind of degree do you need to land a job with a Big Name such as Google? What level of education if you want to go into a specific industry such as Research and Development? Is a bachelor’s degree adequate? Is a Ph.D. necessary? Do Masters get hired more easily and get paid more than Bachelors? These kinds of questions directly related to education levels are particularly intriguing and important for students to understand if they would like eventually go into the field of Data Science, especially into a specific industry or a specific company. However, due to the lack of relevant data from our dataset, we are able to answer the above questions for now. With more information and data, future work can be done with respect to the relationship between education level and data science jobs so that students who aspire to become a data scientists or sort someday could have more reliable and realistic perspectives.

Post-credits scene