How Much Do Data Scientists Earn?

This article reveals the salary distribution of global data science positions! And analyze the impact of position, country, work experience, employment form, and company size on salary, and provide advice on job hunting and job-hopping Tips!

πŸ’‘ Author: Han Nobuko@ShowMeAI πŸ“˜ data Analysis of actual combat series: https://www.showmeai.tech/tutorials/40 πŸ“˜ AI Post & strategy series: https://www.showmeai.tech/tutorials/47 πŸ“˜ Address of this article : https://www.showmeai.tech/article-detail/402 πŸ“’ sound Ming: All rights reserved, please contact the platform and the author for reprinting and indicate the source πŸ“’ Favorite ShowMe AI see more exciting content

πŸ’‘ Introduction

Data science is still gaining popularity in various fields such as Internet, healthcare, telecommunications, retail, sports, aviation, arts, etc. in πŸ“˜G lassdoor Data science jobs rank third on the list of the best jobs in America, with nearly 10,071 job openings for 2022.

In addition to the unique charm of data, the salary of data science-related positions has also attracted much attention. In this article, ShowMeAI The following questions will be analyzed based on the data:

  • What are the highest paying jobs in data science?
  • Which country has the highest salaries and the most opportunities?
  • What is a typical salary range?
  • How important is job level to a data scientist?
  • Data Science, Full Time vs Freelance
  • What are the highest paying jobs in data science?
  • What are the highest paying jobs in data science on average?
  • Minimum and maximum salaries for data science majors
  • What is the size of the company hiring data science professionals?
  • Is the salary related to the size of the company?
  • What is the ratio of WFH (teleworking) to WFO?
  • How do salaries for data science jobs grow every year?
  • If someone is looking for a job related to data science, what would you recommend him to search for online?
  • If you have a few years experience as an entry-level employee, what size company should you consider moving to?

πŸ’‘ Data description

The data set we used this time is πŸ†number According to the scientific job salary dataset, Everyone can pass S howMeAI The Baidu network disk address download.

πŸ† Actual combat data set download (Baidu network disk): Official account "ShowMeAI Research Center" replies "actual combat", or click here inside Get this article [ 37] Data scientist salary analysis and visualization based on pandasql and plotly "ds_salaries data set"

⭐ ShowMeAI official GitHub: https://github.com/ShowMeAI-Hub

The dataset contains 11 columns, the corresponding names and meanings are as follows:

parameter

meaning

work_year

the year the salary was paid

| experience_level : The experience level when paying salary |

| employment_type | employment_type |

| job_title | job title |

| salary | Total salary paid |

| salary_currency | Currency of salary paid |

| salary_in_usd | Normalized salary paid in USD |

| employee_residence | employee's primary country of residence |

| remote_ratio | Total amount of work done remotely |

| company_location | Country where the employer's principal office is located |

| company_size | Company size based on number of employees |

This analysis uses Pandas and SQL, welcome to read ShowMeAI The data analysis tutorial and corresponding tool cheat sheet articles, systematic learning and hands-on practice:

πŸ“˜ figure Solving Data Analysis: From Getting Started to Mastering a Series of Tutorials

πŸ“˜Edit Programming Language Cheat Sheet | SQL Cheat Sheet

πŸ“˜ number Pandas Cheat Sheet | Pandas Cheat Sheet

πŸ“˜ number Matplotlib Cheat Sheet | Matplotlib Cheat Sheet

πŸ’‘ import tool library

We first import the tool library we need to use, we use pandas to read the data, and use Plotly and matplotlib for visualization. And we will use SQL for data analysis in this article, we use it here πŸ“˜p andasql Tool Library.

# For loading data
import pandas as pd
import numpy as np

# For SQL queries
import pandasql as ps

# For ploting graph / Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.offline import iplot
import plotly.figure_factory as ff

import plotly.io as pio
import seaborn as sns
import matplotlib.pyplot as plt

# To show graph below the code or on same notebook
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# To convert country code to country name
import country_converter as coco

import warnings
warnings.filterwarnings('ignore')
copy

πŸ’‘ Loading the dataset

The dataset we downloaded is in CSV format, so we can use the read_csv method to read our dataset.

# Loading data
salaries = pd.read_csv('ds_salaries.csv')
copy

To see the first five records, we can use the salaries.head() method.

Using pandasql to accomplish the same task is like this:

# Function query to execute SQL queries
def query(query):
 return ps.sqldf(query)

# Showing Top 5 rows of data
query("""
        SELECT * 
        FROM salaries 
        LIMIT 5
""")
copy

output:

πŸ’‘ Data preprocessing

The first column "Unnamed: 0" in our data set is useless, we remove it before analysis:

salaries = salaries.drop('Unnamed: 0', axis = 1)
copy

Let's look at the missing values ​​in the dataset:

salaries.isna().sum()
copy

output:

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64
copy

We don't have any missing values ​​in our dataset, so we don't need to do missing value handling, employee_residence and company_location use short country codes. We map replaced with the full name of the country for easier understanding:

# Converting countries code to country names
salaries["employee_residence"] = coco.convert(names=salaries["employee_residence"], to="name")
salaries["company_location"] = coco.convert(names=salaries["company_location"], to="name")
copy

The experience_level in this data set represents different experience levels, using the following abbreviations:

  • CN: Entry Level (entry level)
  • ML: Mid level (intermediate)
  • SE: Senior Level (advanced)
  • EX: Expert Level (senior expert level)

For easier understanding, we also replace these abbreviations with full names.

# Replacing values in column - experience_level :
salaries['experience_level'] = query("""SELECT 
                                          REPLACE(
                                            REPLACE(
                                              REPLACE(
                                                REPLACE(
                                                  experience_level, 'MI', 'Mid level'), 
                                                                    'SE', 'Senior Level'), 
                                                                    'EN', 'Entry Level'), 
                                                                    'EX', 'Expert Level') 
                                        FROM 
                                          salaries""")
copy

In the same way, we also replace the full name of the work form

  • FT: Full Time (full time)
  • PT: Part Time (part-time)
  • CT: Contract (contract system)
  • FL: Freelance (freelance)
# Replacing values in column - experience_level :
salaries['employment_type'] = query("""SELECT 
                                          REPLACE(
                                            REPLACE(
                                              REPLACE(
                                                REPLACE(
                                                  employment_type, 'PT', 'Part Time'), 
                                                                    'FT', 'Full Time'), 
                                                                    'FL', 'Freelance'), 
                                                                    'CT', 'Contract') 
                                        FROM 
                                          salaries""")
copy

The company size field in the dataset is handled as follows:

  • S: Small (small)
  • M: Medium
  • L: Large
# Replacing values in column - company_size :
salaries['company_size'] = query("""SELECT 
                                       REPLACE(
                                         REPLACE(
                                           REPLACE(
                                             company_size, 'M', 'Medium'), 
                                                           'L', 'Large'), 
                                                           'S', 'Small') 
                                    FROM 
                                       salaries""")
copy

We also do some processing on the remote ratio field for better understanding

# Replacing values in column - remote_ratio :
salaries['remote_ratio'] = query("""SELECT 
                                        REPLACE(
                                          REPLACE(
                                            REPLACE(
                                              remote_ratio, '100', 'Fully Remote'), 
                                                            '50', 'Partially Remote'), 
                                                            '0', 'Non Remote Work') 
                                    FROM 
                                      salaries""")
copy

This is the final output after preprocessing.

πŸ’‘ Data Analysis & Visualization

πŸ’¦ What are the highest paying jobs in data science?

top10_jobs = query("""
                    SELECT job_title,
                    Count(*) AS job_count
                    FROM salaries
                    GROUP BY job_title
                    ORDER BY job_count DESC
                    LIMIT 10
""")
copy

Let's draw a bar chart for a more intuitive understanding:

data = go.Bar(x = top10_jobs['job_title'], y = top10_jobs['job_count'],
             text = top10_jobs['job_count'], textposition = 'inside',
             textfont = dict(size = 12,
                            color = 'white'),
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'black',
                          line_width = 1))


layout = go.Layout(title = {'text': "<b>Top 10 Data Science Jobs</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Job Title</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Total</b>'),
                   width = 900,
                   height = 600)


fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Market Distribution of Data Science Jobs

fig = px.pie(top10_jobs, values='job_count', 
              names='job_title', 
              color_discrete_sequence = px.colors.qualitative.Alphabet)


fig.update_layout(title = {'text': "<b>Distribution of job positions</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   width = 900,
                   height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Countries with the Most Data Science Jobs

top10_com_loc = query("""
                    SELECT company_location AS company,
                    Count(*) AS job_count
                    FROM salaries
                    GROUP BY company
                    ORDER BY job_count DESC
                    LIMIT 10
""")


data = go.Bar(x = top10_com_loc['company'], y = top10_com_loc['job_count'],
             textfont = dict(size = 12,
                            color = 'white'),
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'black',
                          line_width = 1))


layout = go.Layout(title = {'text': "<b>Top 10 Data Science Countries</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Countries</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Total</b>'),
                   width = 900,
                   height = 600)


fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

From the graph above, we can see that the United States has the most job opportunities in data science. Now we look at salaries around the world. You can continue to run the code and see the visualization results.

df = salaries
df["company_country"] = coco.convert(names = salaries["company_location"], to = 'name_short')

temp_df = df.groupby('company_country')['salary_in_usd'].sum().reset_index()
temp_df['salary_scale'] = np.log10(df['salary_in_usd'])


fig = px.choropleth(temp_df, locationmode = 'country names', locations = "company_country",
                   color = "salary_scale", hover_name = "company_country",
                   hover_data = temp_df[['salary_in_usd']], 
                    color_continuous_scale = 'Jet',
                   )


fig.update_layout(title={'text':'<b>Salaries across the World</b>', 
                         'xanchor': 'center','x':0.5})
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Average Salary (Currency Based)

df = salaries[['salary_currency','salary_in_usd']].groupby(['salary_currency'], as_index = False).mean().set_index('salary_currency').reset_index().sort_values('salary_in_usd', ascending = False)

#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'salary_currency',
            y = 'salary_in_usd',
            color = 'salary_currency',
            color_discrete_sequence = px.colors.qualitative.Safe,
            )

fig.update_layout(title={'text':'<b>Average salary as a function of currency</b>', 
                         'xanchor': 'center','x':0.5},
                 xaxis_title = '<b>Currency</b>',
                 yaxis_title = '<b>Mean Salary</b>')
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

People earn the most in US dollars, followed by Swiss francs and Singapore dollars.

df = salaries[['company_country','salary_in_usd']].groupby(['company_country'], as_index = False).mean().set_index('company_country').reset_index().sort_values('salary_in_usd', ascending = False)


#Selecting top 14
df = df.iloc[:14]
fig = px.bar(df, x = 'company_country',
            y = 'salary_in_usd',
            color = 'company_country',
            color_discrete_sequence = px.colors.qualitative.Dark2,
            )


fig.update_layout(title = {'text': "<b>Average salary as a function of company location</b>", 
                            'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title = '<b>Company Location</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Mean Salary</b>'),
                   width = 900,
                   height = 600)


fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Data Science Work Experience Level Distribution

job_exp = query("""
            SELECT experience_level, Count(*) AS job_count
            FROM salaries
            GROUP BY experience_level
            ORDER BY job_count ASC
""")



data = go.Bar(x = job_exp['job_count'], y = job_exp['experience_level'],
              orientation = 'h', text = job_exp['job_count'],
             marker = dict(color = px.colors.qualitative.Alphabet,
                          opacity = 0.9,
                          line_color = 'white',
                          line_width = 2))


layout = go.Layout(title = {'text': "<b>Jobs on Experience Levels</b>",
                           'x':0.5, 'xanchor':'center'},
                  xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
                  yaxis = dict(title='<b>Experience lvl</b>'),
                  width = 900,
                  height = 600)

fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2', 
                  paper_bgcolor = '#f1e7d2')
fig.show()
copy

As you can see from the graph above, most data science is at an advanced level, with very few at the expert level.

πŸ’¦ Data Science Jobs Employment Type Distribution

job_emp = query("""
SELECT employment_type,
COUNT(*) AS job_count
FROM salaries
GROUP BY employment_type
ORDER BY job_count ASC
""")


data =  go.Bar(x = job_emp['job_count'], y = job_emp['employment_type'], 
               orientation ='h',text = job_emp['job_count'],
               textposition ='outside',
               marker = dict(color = px.colors.qualitative.Alphabet,
                             opacity = 0.9,
                             line_color = 'white',
                             line_width = 2))


layout = go.Layout(title = {'text': "<b>Jobs on Employment Type</b>",
                           'x':0.5, 'xanchor': 'center'},
                   xaxis = dict(title='<b>Total</b>', tickmode = 'array'),
                   yaxis =dict(title='<b>Emp Type lvl</b>'),
                   width = 900,
                   height = 600)


fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2', 
                  paper_bgcolor = '#f1e7d2')
fig.show()
copy

From the graph above, we can see that the majority of data scientists work full-time, with fewer contract workers and freelancers

πŸ’¦ Data Science Jobs Trends

job_year = query("""
    SELECT work_year, COUNT(*) AS 'job count'
    FROM salaries
    GROUP BY work_year
    ORDER BY 'job count' DESC
""")


data = go.Scatter(x = job_year['work_year'], y = job_year['job count'],
                  marker = dict(size = 20,
                                line_width = 1.5,
                                line_color = 'white',
                                color = px.colors.qualitative.Alphabet),
                  line = dict(color = '#ED7D31', width = 4), mode = 'lines+markers')


layout  = go.Layout(title = {'text' : "<b><i>Data Science jobs Growth (2020 to 2022)</i></b>",
                             'x' : 0.5, 'xanchor' : 'center'},
                    xaxis = dict(title = '<b>Year</b>'),
                    yaxis = dict(title = '<b>Jobs</b>'),
                    width = 900,
                    height = 600)


fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Data Science Job Salary Distribution

salary_usd = query("""
                    SELECT salary_in_usd 
                    FROM salaries
""")


import matplotlib.pyplot as plt

plt.figure(figsize = (20, 8))
sns.set(rc = {'axes.facecolor' : '#f1e7d2',
             'figure.facecolor' : '#f1e7d2'})

p = sns.histplot(salary_usd["salary_in_usd"], 
                kde = True, alpha = 1, fill = True,
                edgecolor = 'black', linewidth = 1)
p.axes.lines[0].set_color("orange")
plt.title("Data Science Salary Distribution \n", fontsize = 25)
plt.xlabel("Salary", fontsize = 18)
plt.ylabel("Count", fontsize = 18)
plt.show()
copy

πŸ’¦ Top 10 Highest Paying Data Science Jobs

salary_hi10 = query("""
    SELECT job_title,
    MAX(salary_in_usd) AS salary
    FROM salaries
    GROUP BY salary
    ORDER BY salary DESC
    LIMIT 10
""")

data = go.Bar(x = salary_hi10['salary'],
             y = salary_hi10['job_title'],
             orientation = 'h',
             text = salary_hi10['salary'],
             textposition = 'inside',
             insidetextanchor = 'middle',
              textfont = dict(size = 13,
                             color = 'black'),
              marker = dict(color = px.colors.qualitative.Alphabet,
                           opacity = 0.9,
                           line_color = 'black',
                           line_width = 1))

layout = go.Layout(title = {'text': "<b>Top 10 Highest paid Data Science Jobs</b>",
                           'x':0.5,
                           'xanchor': 'center'},
                   xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Job Title</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout
                = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

Principal Data Engineer is a high paying job in data science.

πŸ’¦ Average salary and ranking of different positions

salary_av10 = query("""
    SELECT job_title,
    ROUND(AVG(salary_in_usd)) AS salary
    FROM salaries
    GROUP BY job_title
    ORDER BY salary DESC
    LIMIT 10
""")

data = go.Bar(x = salary_av10['salary'],
             y = salary_av10['job_title'],
             orientation = 'h',
             text = salary_av10['salary'],
             textposition = 'inside',
             insidetextanchor = 'middle',
              textfont = dict(size = 13,
                             color = 'white'),
              marker = dict(color = px.colors.qualitative.Alphabet,
                           opacity = 0.9,
                           line_color = 'white',
                           line_width = 2))

layout = go.Layout(title = {'text': "<b>Top 10 Average paid Data Science Jobs</b>",
                           'x':0.5,
                           'xanchor': 'center'},
                   xaxis = dict(title = '<b>salary</b>', tickmode = 'array'),
                   yaxis = dict(title = '<b>Job Title</b>'),
                   width = 900,
                   height = 600)
fig = go.Figure(data = data, layout = layout)
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Data Science Salary Trends

salary_year = query("""
    SELECT ROUND(AVG(salary_in_usd)) AS salary,
    work_year AS year
    FROM salaries
    GROUP BY year
    ORDER BY salary DESC
""")

data = go.Scatter(x = salary_year['year'],
                 y = salary_year['salary'],
                 marker = dict(size = 20,
                 line_width = 1.5,
                 line_color = 'black',
                 color = '#ED7D31'),
                 line = dict(color = 'black', width = 4), mode = 'lines+markers')

layout = go.Layout(title = {'text' : "<b>Data Science Salary Growth (2020 to 2022) </b>",
                            'x' : 0.5,
                            'xanchor' : 'center'},
                   xaxis = dict(title = '<b>Year</b>'),
                   yaxis = dict(title = '<b>Salary</b>'),
                   width = 900,
                   height = 600)


fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(tickvals = ['2020','2021','2022'])
fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Experience Level & Salary

salary_exp = query("""
    SELECT experience_level AS 'Experience Level',
    salary_in_usd AS Salary
    FROM salaries
""")

fig = px.violin(salary_exp, x = 'Experience Level', y = 'Salary', color = 'Experience Level', box = True)

fig.update_layout(title = {'text': "<b>Salary on Experience Level</b>",
                            'xanchor': 'center','x':0.5},
                   xaxis = dict(title = '<b>Experience level</b>'),
                   yaxis = dict(title = '<b>salary</b>', 
                                ticktext = [-300000, 0, 100000, 200000, 300000, 400000, 500000, 600000, 700000]),
                   width = 900,
                   height = 600)

fig.update_layout(paper_bgcolor= '#f1e7d2', 
                  plot_bgcolor = '#f1e7d2', 
                  showlegend = False)
fig.show()
copy

πŸ’¦ Salary trends by experience level

tmp_df = salaries.groupby(['work_year', 'experience_level']).median()
tmp_df.reset_index(inplace = True)

fig = px.line(tmp_df, x='work_year', y='salary_in_usd', color='experience_level', symbol="experience_level")

fig.update_layout(title = {'text': "<b>Median Salary Trend By Experience Level</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Working Year</b>', tickvals = [2020, 2021, 2022], tickmode = 'array'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

Observations 1. During the COVID-19 pandemic (2020-2021), specialist-level salaries are very high, but partially declining. 2. After 2021, the salaries of experts and senior professional titles will increase.

πŸ’¦ Year & Salary Distribution

year_gp = salaries.groupby('work_year')
hist_data = [year_gp.get_group(2020)['salary_in_usd'],
             year_gp.get_group(2021)['salary_in_usd'],
            year_gp.get_group(2022)['salary_in_usd']]
group_labels = ['2020', '2021', '2022']

fig = ff.create_distplot(hist_data, group_labels, show_hist = False)


fig.update_layout(title = {'text': "<b>Salary Distribution By Working Year</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Salary</b>'),
                  yaxis = dict(title = '<b>Kernel Density</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Employment Type & Salary

salary_emp = query("""
    SELECT employment_type AS 'Employment Type',
    salary_in_usd AS Salary
    FROM salaries
""")

fig = px.box(salary_emp,x='Employment Type',y='Salary',
       color = 'Employment Type')


fig.update_layout(title = {'text': "<b>Salary by Employment Type</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Employment Type</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Company size distribution

comp_size = query("""
                SELECT company_size,
                COUNT(*) AS count
                FROM salaries
                GROUP BY company_size
""")


import plotly.graph_objects as go
data = go.Pie(labels = comp_size['company_size'], 
              values = comp_size['count'].values,
              hoverinfo = 'label',
              hole = 0.5,
              textfont_size = 16,
              textposition = 'auto')
fig = go.Figure(data = data)


fig.update_layout(title = {'text': "<b>Company Size</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b></b>'),
                  yaxis = dict(title = '<b></b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Experience level ratio by company size

df = salaries.groupby(['company_size', 'experience_level']).size()
comp_s = np.round(df['Small'].values / df['Small'].values.sum(),2)
comp_m = np.round(df['Medium'].values / df['Medium'].values.sum(),2)
comp_l = np.round(df['Large'].values / df['Large'].values.sum(),2)

fig = go.Figure()
categories = ['Entry Level', 'Expert Level','Mid level','Senior Level']

fig.add_trace(go.Scatterpolar(
    r = comp_s,
    theta = categories,
    fill = 'toself',
    name = 'Company Size S'))

fig.add_trace(go.Scatterpolar(
    r = comp_m,
    theta = categories,
    fill = 'toself',
    name = 'Company Size M'))

fig.add_trace(go.Scatterpolar(
    r = comp_l,
    theta = categories,
    fill = 'toself',
    name = 'Company Size L'))

fig.update_layout(
    polar = dict(
    radialaxis = dict(range = [0, 0.6])),
    showlegend = True,
)


fig.update_layout(title = {'text': "<b>Proportion of Experience Level In Different Company Sizes</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b></b>'),
                  yaxis = dict(title = '<b></b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Different company size & job salary

salary_size = query("""
    SELECT company_size AS 'Company size',
    salary_in_usd AS Salary
    FROM salaries
""")

fig = px.box(salary_size, x='Company size', y = 'Salary',
             color = 'Company size')



fig.update_layout(title = {'text': "<b>Salary by Company size</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Company size</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Ratio of WFH (Teleworking) and WFO

rem_type = query("""
    SELECT remote_ratio,
    COUNT(*) AS total
    FROM salaries
    GROUP BY remote_ratio
""")


data = go.Pie(labels = rem_type['remote_ratio'], values = rem_type['total'].values,
             hoverinfo = 'label',
             hole = 0.4,
             textfont_size = 18,
             textposition = 'auto')

fig = go.Figure(data = data)

fig.update_layout(title = {'text': "<b>Remote Ratio</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Salary is affected by Remote Type

salary_remote = query("""
    SELECT remote_ratio AS 'Remote type',
    salary_in_usd AS Salary
    From salaries
""")

fig = px.box(salary_remote, x = 'Remote type', y = 'Salary', color = 'Remote type')



fig.update_layout(title = {'text': "<b>Salary by Remote Type</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Remote type</b>'),
                  yaxis = dict(title = '<b>Salary</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’¦ Different experience levels & remote ratios

exp_remote = salaries.groupby(['experience_level', 'remote_ratio']).count()
exp_remote.reset_index(inplace = True)

fig = px.histogram(exp_remote, x = 'experience_level',
                  y = 'work_year', color = 'remote_ratio',
                  barmode = 'group',
                  text_auto = True)


fig.update_layout(title = {'text': "<b>Respondent Count In Different Experience Level Based on Remote Ratio</b>", 
                            'x':0.5, 'xanchor': 'center'},
                  xaxis = dict(title = '<b>Experience Level</b>'),
                  yaxis = dict(title = '<b>Number of Respondents</b>'),
                  width = 900,
                  height = 600)

fig.update_layout(plot_bgcolor = '#f1e7d2',
                 paper_bgcolor = '#f1e7d2')
fig.show()
copy

πŸ’‘ Analysis conclusion

  • The top 3 jobs in data science are data scientist, data engineer, and data analyst.
  • Data science jobs are becoming more and more popular. The proportion of employees will increase from 11.9% in 2020 to 52.4% in 2022.
  • The United States is the country with the most data science companies.
  • The IQR of the salary distribution is between 62.7k and 150k.
  • Among data science employees, most are at the senior level, with fewer at the expert level.
  • Most data science employees work full-time, with few contractors and freelancers.
  • Lead data engineer is the highest paying data science job.
  • The minimum salary for data science (entry-level experience) is $4000, and the maximum salary for data science with expert-level experience is $600,000.
  • Company composition: 53.7% mid-sized companies, 32.6% large companies, 13.7% small data science companies.
  • Salaries are also affected by company size, with larger companies paying higher salaries.
  • 62.8% of data science jobs were fully remote, 20.9% were non-remote, and 16.3% were partially remote.
  • Data science salaries grow with time and experience.

References

Tags: SQL Data Analysis Cyber Security Visualization https plotly

Posted by ztkirby on Fri, 09 Dec 2022 15:11:40 +0530