we'll analyze the StackOverflow developer survey dataset. The dataset contains responses to an annual survey conducted by StackOverflow. You can find the raw data & official analysis here: https://insights.stackoverflow.com/survey.
The libraries used in this notebook are:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Let's load the CSV files using the Pandas library. We'll use the name survey_raw_df
for the data frame to indicate this is unprocessed data that we might clean, filter, and modify to prepare a data frame ready for analysis.
survey_raw_df = pd.read_csv('survey_results_public.csv')
survey_raw_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Respondent | MainBranch | Hobbyist | Age | Age1stCode | CompFreq | CompTotal | ConvertedComp | Country | CurrencyDesc | ... | SurveyEase | SurveyLength | Trans | UndergradMajor | WebframeDesireNextYear | WebframeWorkedWith | WelcomeChange | WorkWeekHrs | YearsCode | YearsCodePro | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | I am a developer by profession | Yes | NaN | 13 | Monthly | NaN | NaN | Germany | European Euro | ... | Neither easy nor difficult | Appropriate in length | No | Computer science, computer engineering, or sof... | ASP.NET Core | ASP.NET;ASP.NET Core | Just as welcome now as I felt last year | 50.0 | 36 | 27 |
1 | 2 | I am a developer by profession | No | NaN | 19 | NaN | NaN | NaN | United Kingdom | Pound sterling | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat more welcome now than last year | NaN | 7 | 4 |
2 | 3 | I code primarily as a hobby | Yes | NaN | 15 | NaN | NaN | NaN | Russian Federation | NaN | ... | Neither easy nor difficult | Appropriate in length | NaN | NaN | NaN | NaN | Somewhat more welcome now than last year | NaN | 4 | NaN |
3 | 4 | I am a developer by profession | Yes | 25.0 | 18 | NaN | NaN | NaN | Albania | Albanian lek | ... | NaN | NaN | No | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat less welcome now than last year | 40.0 | 7 | 4 |
4 | 5 | I used to be a developer by profession, but no... | Yes | 31.0 | 16 | NaN | NaN | NaN | United States | NaN | ... | Easy | Too short | No | Computer science, computer engineering, or sof... | Django;Ruby on Rails | Ruby on Rails | Just as welcome now as I felt last year | NaN | 15 | 8 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64456 | 64858 | NaN | Yes | NaN | 16 | NaN | NaN | NaN | United States | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | NaN | NaN | 10 | Less than 1 year |
64457 | 64867 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Morocco | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
64458 | 64898 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Viet Nam | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
64459 | 64925 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Poland | NaN | ... | NaN | NaN | NaN | NaN | Angular;Angular.js;React.js | NaN | NaN | NaN | NaN | NaN |
64460 | 65112 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Spain | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | ASP.NET Core;jQuery | Angular;Angular.js;ASP.NET Core;jQuery | NaN | NaN | NaN | NaN |
64461 rows × 61 columns
The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized to remove personally identifiable information, and each respondent has been assigned a randomized respondent ID.
Let's view the list of columns in the data frame.
survey_raw_df.columns
Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
'MiscTechDesireNextYear', 'MiscTechWorkedWith',
'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
'YearsCodePro'],
dtype='object')
It appears that shortcodes for questions have been used as column names.
We can refer to the schema file to see the full text of each question. The schema file contains only two columns: Column
and QuestionText
. We can load it as Pandas Series with Column
as the index and the QuestionText
as the value.
survey_re_schema = pd.read_csv('survey_results_schema.csv', index_col='Column')
survey_re_schema
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
QuestionText | |
---|---|
Column | |
Respondent | Randomized respondent ID number (not in order ... |
MainBranch | Which of the following options best describes ... |
Hobbyist | Do you code as a hobby? |
Age | What is your age (in years)? If you prefer not... |
Age1stCode | At what age did you write your first line of c... |
... | ... |
WebframeWorkedWith | Which web frameworks have you done extensive d... |
WelcomeChange | Compared to last year, how welcome do you feel... |
WorkWeekHrs | On average, how many hours per week do you wor... |
YearsCode | Including any education, how many years have y... |
YearsCodePro | NOT including education, how many years have y... |
61 rows × 1 columns
schema_raw = survey_re_schema.QuestionText
schema_raw
Column
Respondent Randomized respondent ID number (not in order ...
MainBranch Which of the following options best describes ...
Hobbyist Do you code as a hobby?
Age What is your age (in years)? If you prefer not...
Age1stCode At what age did you write your first line of c...
...
WebframeWorkedWith Which web frameworks have you done extensive d...
WelcomeChange Compared to last year, how welcome do you feel...
WorkWeekHrs On average, how many hours per week do you wor...
YearsCode Including any education, how many years have y...
YearsCodePro NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object
We can now use schema_raw to retrieve the full question text for any column in survey_raw_df.
schema_raw['YearsCodePro']
'NOT including education, how many years have you coded professionally (as a part of your work)?'
We've now loaded the dataset. We're ready to move on to the next step of preprocessing & cleaning the data for our analysis.
While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:
- Demographics of the survey respondents and the global programming community
- Distribution of programming skills, experience, and preferences
- Employment-related information, preferences, and opinions
Let's select a subset of columns with the relevant data for our analysis.
selected_columns = [
# Demographics
'Country',
'Age',
'Gender',
'EdLevel',
'UndergradMajor',
# Programming experience
'Hobbyist',
'Age1stCode',
'YearsCode',
'YearsCodePro',
'LanguageWorkedWith',
'LanguageDesireNextYear',
'NEWLearn',
'NEWStuck',
# Employment
'Employment',
'DevType',
'WorkWeekHrs',
'JobSat',
'JobFactors',
'NEWOvertime',
'NEWEdImpt'
]
len(selected_columns)
20
Let's extract a copy of the data from these columns into a new data frame survey_df. We can continue to modify further without affecting the original data frame.
survey_df = survey_raw_df[selected_columns].copy()
schema = schema_raw[selected_columns]
Let's view some basic information about the data frame.
survey_df.shape
(64461, 20)
survey_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 64072 non-null object
1 Age 45446 non-null float64
2 Gender 50557 non-null object
3 EdLevel 57431 non-null object
4 UndergradMajor 50995 non-null object
5 Hobbyist 64416 non-null object
6 Age1stCode 57900 non-null object
7 YearsCode 57684 non-null object
8 YearsCodePro 46349 non-null object
9 LanguageWorkedWith 57378 non-null object
10 LanguageDesireNextYear 54113 non-null object
11 NEWLearn 56156 non-null object
12 NEWStuck 54983 non-null object
13 Employment 63854 non-null object
14 DevType 49370 non-null object
15 WorkWeekHrs 41151 non-null float64
16 JobSat 45194 non-null object
17 JobFactors 49349 non-null object
18 NEWOvertime 43231 non-null object
19 NEWEdImpt 48465 non-null object
dtypes: float64(2), object(18)
memory usage: 9.8+ MB
Most columns have the data type object, either because they contain values of different types or contain empty values (NaN). It appears that every column contains some empty values since the Non-Null count for every column is lower than the total number of rows (64461). We'll need to deal with empty values and manually adjust the data type for each column on a case-by-case basis.
Only two of the columns were detected as numeric columns (Age and WorkWeekHrs), even though a few other columns have mostly numeric values. To make our analysis easier, let's convert some other columns into numeric data types while ignoring any non-numeric value. The non-numeric are converted to NaN.
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')
Let's now view some basic statistics about numeric columns.
survey_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Age1stCode | YearsCode | YearsCodePro | WorkWeekHrs | |
---|---|---|---|---|---|
count | 45446.000000 | 57473.000000 | 56784.000000 | 44133.000000 | 41151.000000 |
mean | 30.834111 | 15.476572 | 12.782051 | 8.869667 | 40.782174 |
std | 9.585392 | 5.114081 | 9.490657 | 7.759961 | 17.816383 |
min | 1.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 24.000000 | 12.000000 | 6.000000 | 3.000000 | 40.000000 |
50% | 29.000000 | 15.000000 | 10.000000 | 6.000000 | 40.000000 |
75% | 35.000000 | 18.000000 | 17.000000 | 12.000000 | 44.000000 |
max | 279.000000 | 85.000000 | 50.000000 | 50.000000 | 475.000000 |
There seems to be a problem with the age column, as the minimum value is 1 and the maximum is 279. This is a common issue with surveys: responses may contain invalid values due to accidental or intentional errors while responding. A simple fix would be to ignore the rows where the age is higher than 90 years or lower than 10 years as invalid survey responses. We can do this using the .drop method
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 90].index, inplace=True)
The same holds for WorkWeekHrs. Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)
survey_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Age | Age1stCode | YearsCode | YearsCodePro | WorkWeekHrs | |
---|---|---|---|---|---|
count | 45304.000000 | 57315.000000 | 56625.000000 | 43987.000000 | 40995.000000 |
mean | 30.810193 | 15.475635 | 12.784336 | 8.873099 | 40.024497 |
std | 9.429350 | 5.115102 | 9.494409 | 7.762089 | 10.628110 |
min | 10.000000 | 5.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 24.000000 | 12.000000 | 6.000000 | 3.000000 | 40.000000 |
50% | 29.000000 | 15.000000 | 10.000000 | 6.000000 | 40.000000 |
75% | 35.000000 | 18.000000 | 17.000000 | 12.000000 | 43.000000 |
max | 89.000000 | 85.000000 | 50.000000 | 50.000000 | 140.000000 |
The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.
survey_df.Gender.value_counts()
Man 45891
Woman 3833
Non-binary, genderqueer, or gender non-conforming 382
Man;Non-binary, genderqueer, or gender non-conforming 121
Woman;Non-binary, genderqueer, or gender non-conforming 92
Woman;Man 73
Woman;Man;Non-binary, genderqueer, or gender non-conforming 23
Name: Gender, dtype: int64
survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)
survey_df.Gender.value_counts()
Man 45891
Woman 3833
Non-binary, genderqueer, or gender non-conforming 382
Name: Gender, dtype: int64
We've now cleaned up and prepared the dataset for analysis. Let's take a look at a sample of rows from the data frame.
survey_df.sample(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Country | Age | Gender | EdLevel | UndergradMajor | Hobbyist | Age1stCode | YearsCode | YearsCodePro | LanguageWorkedWith | LanguageDesireNextYear | NEWLearn | NEWStuck | Employment | DevType | WorkWeekHrs | JobSat | JobFactors | NEWOvertime | NEWEdImpt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29326 | Switzerland | 39.0 | Man | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | No | 33.0 | 4.0 | 1.0 | Bash/Shell/PowerShell;Java;SQL | Bash/Shell/PowerShell;Java;Python;Rust;SQL | Once a year | Call a coworker or friend;Visit Stack Overflow... | Employed full-time | Developer, back-end | 42.0 | Slightly satisfied | Diversity of the company or organization;Langu... | Sometimes: 1-2 days per month but less than we... | Critically important |
28044 | United States | 33.0 | Man | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | A humanities discipline (such as literature, h... | Yes | 27.0 | 6.0 | 2.0 | HTML/CSS;JavaScript;Python;SQL;TypeScript | Swift | Once every few years | Call a coworker or friend;Visit Stack Overflow... | Employed full-time | Developer, front-end;Developer, full-stack;Dev... | 41.0 | Very dissatisfied | Flex time or a flexible schedule;Languages, fr... | Rarely: 1-2 days per year or less | Fairly important |
24261 | United States | 37.0 | Man | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | A humanities discipline (such as literature, h... | Yes | 33.0 | 4.0 | 3.0 | HTML/CSS;Java;JavaScript;SQL | JavaScript;Python;TypeScript | Once a year | Meditate;Call a coworker or friend;Visit Stack... | Employed full-time | Developer, back-end;Developer, front-end;Devel... | 35.0 | Very satisfied | Flex time or a flexible schedule;Office enviro... | Sometimes: 1-2 days per month but less than we... | Not at all important/not necessary |
27715 | Germany | 37.0 | Man | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | Yes | 17.0 | 16.0 | 9.0 | Bash/Shell/PowerShell;C;C#;Java;JavaScript;Obj... | Rust;TypeScript | Once a year | Visit Stack Overflow;Watch help / tutorial vid... | Employed full-time | Developer, desktop or enterprise applications;... | 40.0 | Slightly dissatisfied | NaN | Sometimes: 1-2 days per month but less than we... | Very important |
20230 | United States | 15.0 | Man | Secondary school (e.g. American high school, G... | NaN | Yes | 9.0 | 7.0 | 3.0 | Bash/Shell/PowerShell;C;C++;Dart;HTML/CSS;Java... | C;C++;Dart;Java;JavaScript;Kotlin;SQL;Swift | Every few months | Call a coworker or friend;Visit Stack Overflow... | Independent contractor, freelancer, or self-em... | Database administrator;Designer;Developer, bac... | 25.0 | Slightly satisfied | Languages, frameworks, and other technologies ... | Never | Fairly important |
8387 | Lithuania | 29.0 | Man | Bachelor’s degree (B.A., B.S., B.Eng., etc.) | Information systems, information technology, o... | Yes | 20.0 | 2.0 | 4.0 | Bash/Shell/PowerShell;HTML/CSS;Python | C | Once a year | Play games;Visit Stack Overflow;Go for a walk ... | Independent contractor, freelancer, or self-em... | Developer, back-end;Developer, front-end;Devel... | 50.0 | Slightly satisfied | Languages, frameworks, and other technologies ... | Often: 1-2 days per week or more | Somewhat important |
40096 | Sweden | NaN | NaN | NaN | NaN | Yes | NaN | NaN | NaN | C;C#;Java;JavaScript | JavaScript;Rust;Scala;TypeScript | Every few months | NaN | Employed full-time | NaN | NaN | NaN | NaN | NaN | NaN |
2318 | Ukraine | 27.0 | Man | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | Yes | 11.0 | 6.0 | 4.0 | HTML/CSS;Java;Kotlin | C#;Kotlin | Once a year | Play games;Visit Stack Overflow;Go for a walk ... | Employed full-time | Developer, mobile | 40.0 | Slightly satisfied | Flex time or a flexible schedule;Languages, fr... | Sometimes: 1-2 days per month but less than we... | Fairly important |
25393 | India | 23.0 | Man | Master’s degree (M.A., M.S., M.Eng., MBA, etc.) | Computer science, computer engineering, or sof... | Yes | 17.0 | 5.0 | NaN | C;C++;Python | C++;Python | Once a year | Meditate;Call a coworker or friend;Visit Stack... | Student | NaN | NaN | NaN | Languages, frameworks, and other technologies ... | NaN | NaN |
41107 | United Kingdom | NaN | NaN | NaN | NaN | Yes | NaN | NaN | NaN | Assembly;C | Assembly;C | Once a year | NaN | Employed full-time | NaN | NaN | NaN | NaN | NaN | NaN |
Before we ask questions about the survey responses, it would help to understand the respondents' demographics, i.e., country, age, gender, education level, employment level, etc. It's essential to explore these variables to understand how representative the survey is of the worldwide programming community.
Let us start by setting up some parameters for the plots that we are going to create
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (13, 8)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let's look at the number of countries from which there are responses in the survey and plot the ten countries with the highest number of responses.
schema.Country
'Where do you live?'
survey_df.Country.nunique()
183
We can identify the countries with the highest number of respondents using the value_counts method.
top_countries = survey_df.Country.value_counts().head(15)
top_countries
United States 12370
India 8360
United Kingdom 3880
Germany 3864
Canada 2174
France 1884
Brazil 1804
Netherlands 1332
Poland 1259
Australia 1199
Spain 1157
Italy 1115
Russian Federation 1085
Sweden 879
Pakistan 802
Name: Country, dtype: int64
We can visualize this information using a bar chart.
plt.xticks(rotation=75)
plt.title(schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);
It appears that a disproportionately high number of respondents are from the US and India, probably because the survey is in English, and these countries have the highest English-speaking populations. We can already see that the survey may not be representative of the global programming community - especially from non-English speaking countries. Programmers from non-English speaking countries are almost certainly underrepresented.
The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it.
plt.title(schema.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')
plt.hist(survey_df.Age, bins=np.arange(10, 90, 5));
It appears that a large percentage of respondents are 20-45 years old. It's somewhat representative of the programming community in general. Many young people have taken up computer science as their field of study or profession in the last 20 years.
Let's look at the distribution of responses for the Gender. It's a well-known fact that women and non-binary genders are underrepresented in the programming community, so we might expect to see a skewed distribution here.
schema.Gender
'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.'
gender_counts = survey_df.Gender.value_counts()
gender_counts
Man 45891
Woman 3833
Non-binary, genderqueer, or gender non-conforming 382
Name: Gender, dtype: int64
A pie chart would be a great way to visualize the distribution.
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180)
plt.title(schema.Gender)
Text(0.5, 1.0, 'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.')
Only about 8% of survey respondents who have answered the question identify as women or non-binary. This number is lower than the overall percentage of women & non-binary genders in the programming community - which is estimated to be around 12%.
Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this. We'll use a horizontal bar plot here.
Ed_pct = survey_df.EdLevel.value_counts() * 100 / survey_df.EdLevel.count()
sns.barplot(x=Ed_pct, y=Ed_pct.index)
plt.title(schema['EdLevel'])
plt.ylabel(None);
It appears that well over half of the respondents hold a bachelor's or master's degree, so most programmers seem to have some college education. However, it's not clear from this graph alone if they hold a degree in computer science.
Let's also plot undergraduate majors, but this time we'll convert the numbers into percentages and sort the values to make it easier to visualize the order.
schema.UndergradMajor
'What was your primary field of study?'
UnderM_pct = survey_df.UndergradMajor.value_counts () * 100 / survey_df.UndergradMajor.count()
sns.barplot(x=UnderM_pct, y=UnderM_pct.index)
plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');
It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer.
Freelancing or contract work is a common choice among programmers, so it would be interesting to compare the breakdown between full-time, part-time, and freelance work. Let's visualize the data from the Employment
column.
schema.Employment
'Which of the following best describes your current employment status?'
(survey_df.Employment.value_counts(normalize=True, ascending=True)*100.).plot(kind='barh')
plt.title(schema.Employment)
plt.xlabel('Percentage');
It appears that close to 10% of respondents are employed part time or as freelancers.
The DevType field contains information about the roles held by respondents. Since the question allows multiple answers, the column contains lists of values separated by a semi-colon ;, making it a bit harder to analyze directly.
schema.DevType
'Which of the following describe you? Please select all that apply.'
survey_df.DevType.value_counts()
Developer, full-stack 4395
Developer, back-end 3056
Developer, back-end;Developer, front-end;Developer, full-stack 2214
Developer, back-end;Developer, full-stack 1465
Developer, front-end 1390
...
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Developer, QA or test;Senior executive/VP 1
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Product manager;Senior executive/VP 1
Developer, back-end;Developer, full-stack;Developer, mobile;DevOps specialist;Educator;System administrator 1
Data or business analyst;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, mobile;Engineering manager 1
Data or business analyst;Developer, mobile;Senior executive/VP;System administrator 1
Name: DevType, Length: 8212, dtype: int64
Let's define a helper function that turns a column containing lists of values (like survey_df.DevType) into a data frame with one column for each possible option.
def split_multicolumn(col_series):
result_df = col_series.to_frame()
options = []
# Iterate over the column
for idx, value in col_series[col_series.notnull()].iteritems():
# Break each value into list of options
for option in value.split(';'):
# Add the option as a column to result
if not option in result_df.columns:
options.append(option)
result_df[option] = False
# Mark the value in the option column as True
result_df.at[idx, option] = True
return result_df[options]
dev_type_df = split_multicolumn(survey_df.DevType)
dev_type_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Developer, desktop or enterprise applications | Developer, full-stack | Developer, mobile | Designer | Developer, front-end | Developer, back-end | Developer, QA or test | DevOps specialist | Developer, game or graphics | Database administrator | ... | System administrator | Engineering manager | Product manager | Data or business analyst | Academic researcher | Data scientist or machine learning specialist | Scientist | Senior executive/VP | Engineer, site reliability | Marketing or sales professional | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | True | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64456 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
64457 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64458 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64459 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64460 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64291 rows × 23 columns
The dev_type_df has one column for each option that can be selected as a response. If a respondent has chosen an option, the corresponding column's value is True. Otherwise, it is False.
We can now use the column-wise totals to identify the most common roles.
dev_type_totals = dev_type_df.sum().sort_values(ascending=False)
dev_type_totals
Developer, back-end 26991
Developer, full-stack 26910
Developer, front-end 18124
Developer, desktop or enterprise applications 11686
Developer, mobile 9404
DevOps specialist 5913
Database administrator 5655
Designer 5260
System administrator 5183
Developer, embedded applications or devices 4700
Data or business analyst 3969
Data scientist or machine learning specialist 3937
Developer, QA or test 3892
Engineer, data 3699
Academic researcher 3501
Educator 2894
Developer, game or graphics 2749
Engineering manager 2698
Product manager 2470
Scientist 2058
Engineer, site reliability 1920
Senior executive/VP 1291
Marketing or sales professional 624
dtype: int64
plt.figure(figsize=(12, 12))
sns.barplot(x=dev_type_totals, y=dev_type_totals.index)
plt.title('How Developers identify their roles?')
plt.xlabel('Count')
plt.ylabel(None);
As one might expect, the most common roles include "Developer" in the name.
We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset. Let's ask some specific questions and try to answer them using data frame operations and visualizations.
To answer, this we can use the LanguageWorkedWith
column. Similar to DevType
, respondents were allowed to choose multiple options here.
survey_df.LanguageWorkedWith
0 C#;HTML/CSS;JavaScript
1 JavaScript;Swift
2 Objective-C;Python;Swift
3 NaN
4 HTML/CSS;Ruby;SQL
...
64456 NaN
64457 Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458 NaN
64459 HTML/CSS
64460 C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64291, dtype: object
languages_worked_df = split_multicolumn(survey_df.LanguageWorkedWith)
languages_worked_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
C# | HTML/CSS | JavaScript | Swift | Objective-C | Python | Ruby | SQL | Java | PHP | ... | VBA | Perl | Scala | C++ | Go | Haskell | Rust | Dart | Julia | Assembly | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | True | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | True | True | True | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | True | False | False | False | False | True | True | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64456 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64457 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
64458 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64459 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
64460 | True | True | True | False | False | False | False | True | True | False | ... | False | False | False | False | False | False | False | False | False | False |
64291 rows × 25 columns
It appears that a total of 25 languages were included among the options. Let's aggregate these to identify the percentage of respondents who selected each language.
languages_worked_pct = languages_worked_df.mean().sort_values(ascending=False) * 100
languages_worked_pct
JavaScript 59.896409
HTML/CSS 55.805634
SQL 48.445350
Python 39.002349
Java 35.620849
Bash/Shell/PowerShell 29.240485
C# 27.801714
PHP 23.126099
TypeScript 22.463486
C++ 21.111820
C 19.234419
Go 7.756918
Kotlin 6.885878
Ruby 6.223266
Assembly 5.442441
VBA 5.389557
Swift 5.224682
R 5.059806
Rust 4.496741
Objective-C 3.600815
Dart 3.513711
Scala 3.148186
Perl 2.754662
Haskell 1.858736
Julia 0.779269
dtype: float64
We can plot this information using a horizontal bar chart.
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_worked_pct, y=languages_worked_pct.index)
plt.title("Languages used in the past year");
plt.xlabel('Percentage');
Perhaps unsurprisingly, Javascript & HTML/CSS comes out at the top as web development is one of today's most sought skills. It also happens to be one of the easiest to get started. SQL is necessary for working with relational databases, so it's no surprise that most programmers work with SQL regularly. Python seems to be the popular choice for other forms of development, beating out Java, which was the industry standard for server & application development for over two decades.
For this, we can use the LanguageDesireNextYear
column, with similar processing as the previous one.
languages_interested_df = split_multicolumn(survey_df.LanguageDesireNextYear)
languages_interested_pct = languages_interested_df.mean().sort_values(ascending=False) * 100
languages_interested_pct
Python 41.150394
JavaScript 40.430231
HTML/CSS 32.032477
SQL 30.803689
TypeScript 26.456269
C# 21.060491
Java 20.464762
Go 19.433513
Bash/Shell/PowerShell 18.058515
Rust 16.271329
C++ 15.014543
Kotlin 14.761009
PHP 10.945544
C 9.362119
Swift 8.693285
Dart 7.308955
R 6.571682
Ruby 6.423916
Scala 5.327340
Haskell 4.594733
Assembly 3.767246
Julia 2.541569
Objective-C 2.339363
Perl 1.760744
VBA 1.608312
dtype: float64
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_interested_pct, y=languages_interested_pct.index)
plt.title("Languages people are intersted in learning over the next year");
plt.xlabel('Percentage');
Once again, it's not surprising that Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for a variety of domains: application development, numerical computing, data analysis, machine learning, big data, cloud automation, web scraping, scripting, etc. We're using Python for this very analysis, so we're in good company!
Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?
While this question may seem tricky at first, it's straightforward to solve using Pandas array operations. Here's what we can do:
- Create a new data frame
languages_loved_df
that contains aTrue
value for a language only if the corresponding values inlanguages_worked_df
andlanguages_interested_df
are bothTrue
- Take the column-wise sum of
languages_loved_df
and divide it by the column-wise sum oflanguages_worked_df
to get the percentage of respondents who "love" the language - Sort the results in decreasing order and plot a horizontal bar graph
languages_loved_df = languages_worked_df & languages_interested_df
languages_loved_pct = (languages_loved_df.sum() * 100/ languages_worked_df.sum()).sort_values(ascending=False)
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_loved_pct, y=languages_loved_pct.index)
plt.title("Most loved languages")
plt.xlabel('Percentage');
Rust has been StackOverflow's most-loved language for four years in a row. The second most-loved language is TypeScript, a popular alternative to JavaScript for web development.
Python features at number 3, despite already being one of the most widely-used languages in the world. Python has a solid foundation, is easy to learn & use, has a large ecosystem of domain-specific libraries, and a massive worldwide community.
Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.
To answer this question, we'll need to use the groupby
data frame method to aggregate the rows for each country. We'll also need to filter the results to only include the countries with more than 250 respondents.
countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)
h_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)
h_response_countries_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
WorkWeekHrs | |
---|---|
Country | |
Iran | 44.337748 |
Israel | 43.915094 |
China | 42.150000 |
United States | 41.799858 |
Greece | 41.402724 |
Viet Nam | 41.391667 |
South Africa | 41.023460 |
Turkey | 40.982143 |
Sri Lanka | 40.612245 |
New Zealand | 40.457551 |
Belgium | 40.444444 |
Canada | 40.208837 |
Hungary | 40.194340 |
India | 40.100349 |
Bangladesh | 40.097458 |
h_response_countries_df.plot(kind='bar')
plt.title('which countries do developers work the highest number of hours per week?')
plt.xticks(rotation=75);
The Asian countries like Iran, China, and Israel have the highest working hours, followed by the United States. However, there isn't too much variation overall, and the average working hours seem to be around 40 hours per week.
Let's create a scatter plot of Age
vs. YearsCodePro
(i.e., years of coding experience) to answer this question.
schema.YearsCodePro
'NOT including education, how many years have you coded professionally (as a part of your work)?'
sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)
plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");
You can see points all over the graph, which indicates that you can start programming professionally at any age. Many people who have been coding for several decades professionally also seem to enjoy it as a hobby.
We can also view the distribution of the Age1stCode column to see when the respondents tried programming for the first time.
plt.title(schema.Age1stCode)
ax = sns.histplot(x=survey_df.Age1stCode, bins=30, kde=True);
ax.lines[0].set_color('crimson');
As you might expect, most people seem to have had some exposure to programming before the age of 40. However, but there are people of all ages and walks of life learning to code.
We've drawn many inferences from the survey. Here's a summary of a few of them:
-
Based on the survey respondents' demographics, we can infer that the survey is somewhat representative of the overall programming community. However, it has fewer responses from programmers in non-English-speaking countries and women & non-binary genders.
-
The programming community is not as diverse as it can be. Although things are improving, we should make more efforts to support & encourage underrepresented communities, whether in terms of age, country, race, gender, or otherwise.
-
Although most programmers hold a college degree, a reasonably large percentage did not have computer science as their college major. Hence, a computer science degree isn't compulsory for learning to code or building a career in programming.
-
A significant percentage of programmers either work part-time or as freelancers, which can be a great way to break into the field, especially when you're just getting started.
-
Javascript & HTML/CSS are the most used programming languages in 2020, closely followed by SQL & Python.
-
Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for various domains.
-
Rust and TypeScript are the most "loved" languages in 2020, both of which have small but fast-growing communities. Python is a close third, despite already being a widely used language.
-
Programmers worldwide seem to be working for around 40 hours a week on average, with slight variations by country.
-
You can learn and start programming professionally at any age. You're likely to have a long and fulfilling career if you also enjoy programming as a hobby.