Skip to content

zain2525/Analysation-for-stackoverflow-survey-with-python

Repository files navigation

analyzation for stackoverflow-survey with python

Introduction

we'll analyze the StackOverflow developer survey dataset. The dataset contains responses to an annual survey conducted by StackOverflow. You can find the raw data & official analysis here: https://insights.stackoverflow.com/survey.

Importing Libraries

The libraries used in this notebook are:

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Let's load the CSV files using the Pandas library. We'll use the name survey_raw_df for the data frame to indicate this is unprocessed data that we might clean, filter, and modify to prepare a data frame ready for analysis.

survey_raw_df = pd.read_csv('survey_results_public.csv')
survey_raw_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Respondent MainBranch Hobbyist Age Age1stCode CompFreq CompTotal ConvertedComp Country CurrencyDesc ... SurveyEase SurveyLength Trans UndergradMajor WebframeDesireNextYear WebframeWorkedWith WelcomeChange WorkWeekHrs YearsCode YearsCodePro
0 1 I am a developer by profession Yes NaN 13 Monthly NaN NaN Germany European Euro ... Neither easy nor difficult Appropriate in length No Computer science, computer engineering, or sof... ASP.NET Core ASP.NET;ASP.NET Core Just as welcome now as I felt last year 50.0 36 27
1 2 I am a developer by profession No NaN 19 NaN NaN NaN United Kingdom Pound sterling ... NaN NaN NaN Computer science, computer engineering, or sof... NaN NaN Somewhat more welcome now than last year NaN 7 4
2 3 I code primarily as a hobby Yes NaN 15 NaN NaN NaN Russian Federation NaN ... Neither easy nor difficult Appropriate in length NaN NaN NaN NaN Somewhat more welcome now than last year NaN 4 NaN
3 4 I am a developer by profession Yes 25.0 18 NaN NaN NaN Albania Albanian lek ... NaN NaN No Computer science, computer engineering, or sof... NaN NaN Somewhat less welcome now than last year 40.0 7 4
4 5 I used to be a developer by profession, but no... Yes 31.0 16 NaN NaN NaN United States NaN ... Easy Too short No Computer science, computer engineering, or sof... Django;Ruby on Rails Ruby on Rails Just as welcome now as I felt last year NaN 15 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
64456 64858 NaN Yes NaN 16 NaN NaN NaN United States NaN ... NaN NaN NaN Computer science, computer engineering, or sof... NaN NaN NaN NaN 10 Less than 1 year
64457 64867 NaN Yes NaN NaN NaN NaN NaN Morocco NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64458 64898 NaN Yes NaN NaN NaN NaN NaN Viet Nam NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
64459 64925 NaN Yes NaN NaN NaN NaN NaN Poland NaN ... NaN NaN NaN NaN Angular;Angular.js;React.js NaN NaN NaN NaN NaN
64460 65112 NaN Yes NaN NaN NaN NaN NaN Spain NaN ... NaN NaN NaN Computer science, computer engineering, or sof... ASP.NET Core;jQuery Angular;Angular.js;ASP.NET Core;jQuery NaN NaN NaN NaN

64461 rows × 61 columns

The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized to remove personally identifiable information, and each respondent has been assigned a randomized respondent ID.

Let's view the list of columns in the data frame.

survey_raw_df.columns
Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
       'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
       'YearsCodePro'],
      dtype='object')

It appears that shortcodes for questions have been used as column names.

We can refer to the schema file to see the full text of each question. The schema file contains only two columns: Column and QuestionText. We can load it as Pandas Series with Column as the index and the QuestionText as the value.

survey_re_schema = pd.read_csv('survey_results_schema.csv', index_col='Column')
survey_re_schema
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
QuestionText
Column
Respondent Randomized respondent ID number (not in order ...
MainBranch Which of the following options best describes ...
Hobbyist Do you code as a hobby?
Age What is your age (in years)? If you prefer not...
Age1stCode At what age did you write your first line of c...
... ...
WebframeWorkedWith Which web frameworks have you done extensive d...
WelcomeChange Compared to last year, how welcome do you feel...
WorkWeekHrs On average, how many hours per week do you wor...
YearsCode Including any education, how many years have y...
YearsCodePro NOT including education, how many years have y...

61 rows × 1 columns

schema_raw = survey_re_schema.QuestionText
schema_raw
Column
Respondent            Randomized respondent ID number (not in order ...
MainBranch            Which of the following options best describes ...
Hobbyist                                        Do you code as a hobby?
Age                   What is your age (in years)? If you prefer not...
Age1stCode            At what age did you write your first line of c...
                                            ...                        
WebframeWorkedWith    Which web frameworks have you done extensive d...
WelcomeChange         Compared to last year, how welcome do you feel...
WorkWeekHrs           On average, how many hours per week do you wor...
YearsCode             Including any education, how many years have y...
YearsCodePro          NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object

We can now use schema_raw to retrieve the full question text for any column in survey_raw_df.

schema_raw['YearsCodePro']
'NOT including education, how many years have you coded professionally (as a part of your work)?'

We've now loaded the dataset. We're ready to move on to the next step of preprocessing & cleaning the data for our analysis.

Data Preparation & Cleaning

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

  • Demographics of the survey respondents and the global programming community
  • Distribution of programming skills, experience, and preferences
  • Employment-related information, preferences, and opinions

Let's select a subset of columns with the relevant data for our analysis.

selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]
len(selected_columns)
20

Let's extract a copy of the data from these columns into a new data frame survey_df. We can continue to modify further without affecting the original data frame.

survey_df = survey_raw_df[selected_columns].copy()
schema = schema_raw[selected_columns]

Let's view some basic information about the data frame.

survey_df.shape
(64461, 20)
survey_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 64072 non-null  object 
 1   Age                     45446 non-null  float64
 2   Gender                  50557 non-null  object 
 3   EdLevel                 57431 non-null  object 
 4   UndergradMajor          50995 non-null  object 
 5   Hobbyist                64416 non-null  object 
 6   Age1stCode              57900 non-null  object 
 7   YearsCode               57684 non-null  object 
 8   YearsCodePro            46349 non-null  object 
 9   LanguageWorkedWith      57378 non-null  object 
 10  LanguageDesireNextYear  54113 non-null  object 
 11  NEWLearn                56156 non-null  object 
 12  NEWStuck                54983 non-null  object 
 13  Employment              63854 non-null  object 
 14  DevType                 49370 non-null  object 
 15  WorkWeekHrs             41151 non-null  float64
 16  JobSat                  45194 non-null  object 
 17  JobFactors              49349 non-null  object 
 18  NEWOvertime             43231 non-null  object 
 19  NEWEdImpt               48465 non-null  object 
dtypes: float64(2), object(18)
memory usage: 9.8+ MB

Most columns have the data type object, either because they contain values of different types or contain empty values (NaN). It appears that every column contains some empty values since the Non-Null count for every column is lower than the total number of rows (64461). We'll need to deal with empty values and manually adjust the data type for each column on a case-by-case basis.

Only two of the columns were detected as numeric columns (Age and WorkWeekHrs), even though a few other columns have mostly numeric values. To make our analysis easier, let's convert some other columns into numeric data types while ignoring any non-numeric value. The non-numeric are converted to NaN.

survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

Let's now view some basic statistics about numeric columns.

survey_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Age Age1stCode YearsCode YearsCodePro WorkWeekHrs
count 45446.000000 57473.000000 56784.000000 44133.000000 41151.000000
mean 30.834111 15.476572 12.782051 8.869667 40.782174
std 9.585392 5.114081 9.490657 7.759961 17.816383
min 1.000000 5.000000 1.000000 1.000000 1.000000
25% 24.000000 12.000000 6.000000 3.000000 40.000000
50% 29.000000 15.000000 10.000000 6.000000 40.000000
75% 35.000000 18.000000 17.000000 12.000000 44.000000
max 279.000000 85.000000 50.000000 50.000000 475.000000

There seems to be a problem with the age column, as the minimum value is 1 and the maximum is 279. This is a common issue with surveys: responses may contain invalid values due to accidental or intentional errors while responding. A simple fix would be to ignore the rows where the age is higher than 90 years or lower than 10 years as invalid survey responses. We can do this using the .drop method

survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 90].index, inplace=True)

The same holds for WorkWeekHrs. Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).

survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)
survey_df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Age Age1stCode YearsCode YearsCodePro WorkWeekHrs
count 45304.000000 57315.000000 56625.000000 43987.000000 40995.000000
mean 30.810193 15.475635 12.784336 8.873099 40.024497
std 9.429350 5.115102 9.494409 7.762089 10.628110
min 10.000000 5.000000 1.000000 1.000000 1.000000
25% 24.000000 12.000000 6.000000 3.000000 40.000000
50% 29.000000 15.000000 10.000000 6.000000 40.000000
75% 35.000000 18.000000 17.000000 12.000000 43.000000
max 89.000000 85.000000 50.000000 50.000000 140.000000

The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.

survey_df.Gender.value_counts()
Man                                                            45891
Woman                                                           3833
Non-binary, genderqueer, or gender non-conforming                382
Man;Non-binary, genderqueer, or gender non-conforming            121
Woman;Non-binary, genderqueer, or gender non-conforming           92
Woman;Man                                                         73
Woman;Man;Non-binary, genderqueer, or gender non-conforming       23
Name: Gender, dtype: int64
survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)
survey_df.Gender.value_counts()
Man                                                  45891
Woman                                                 3833
Non-binary, genderqueer, or gender non-conforming      382
Name: Gender, dtype: int64

We've now cleaned up and prepared the dataset for analysis. Let's take a look at a sample of rows from the data frame.

survey_df.sample(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Country Age Gender EdLevel UndergradMajor Hobbyist Age1stCode YearsCode YearsCodePro LanguageWorkedWith LanguageDesireNextYear NEWLearn NEWStuck Employment DevType WorkWeekHrs JobSat JobFactors NEWOvertime NEWEdImpt
29326 Switzerland 39.0 Man Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Computer science, computer engineering, or sof... No 33.0 4.0 1.0 Bash/Shell/PowerShell;Java;SQL Bash/Shell/PowerShell;Java;Python;Rust;SQL Once a year Call a coworker or friend;Visit Stack Overflow... Employed full-time Developer, back-end 42.0 Slightly satisfied Diversity of the company or organization;Langu... Sometimes: 1-2 days per month but less than we... Critically important
28044 United States 33.0 Man Bachelor’s degree (B.A., B.S., B.Eng., etc.) A humanities discipline (such as literature, h... Yes 27.0 6.0 2.0 HTML/CSS;JavaScript;Python;SQL;TypeScript Swift Once every few years Call a coworker or friend;Visit Stack Overflow... Employed full-time Developer, front-end;Developer, full-stack;Dev... 41.0 Very dissatisfied Flex time or a flexible schedule;Languages, fr... Rarely: 1-2 days per year or less Fairly important
24261 United States 37.0 Man Bachelor’s degree (B.A., B.S., B.Eng., etc.) A humanities discipline (such as literature, h... Yes 33.0 4.0 3.0 HTML/CSS;Java;JavaScript;SQL JavaScript;Python;TypeScript Once a year Meditate;Call a coworker or friend;Visit Stack... Employed full-time Developer, back-end;Developer, front-end;Devel... 35.0 Very satisfied Flex time or a flexible schedule;Office enviro... Sometimes: 1-2 days per month but less than we... Not at all important/not necessary
27715 Germany 37.0 Man Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Computer science, computer engineering, or sof... Yes 17.0 16.0 9.0 Bash/Shell/PowerShell;C;C#;Java;JavaScript;Obj... Rust;TypeScript Once a year Visit Stack Overflow;Watch help / tutorial vid... Employed full-time Developer, desktop or enterprise applications;... 40.0 Slightly dissatisfied NaN Sometimes: 1-2 days per month but less than we... Very important
20230 United States 15.0 Man Secondary school (e.g. American high school, G... NaN Yes 9.0 7.0 3.0 Bash/Shell/PowerShell;C;C++;Dart;HTML/CSS;Java... C;C++;Dart;Java;JavaScript;Kotlin;SQL;Swift Every few months Call a coworker or friend;Visit Stack Overflow... Independent contractor, freelancer, or self-em... Database administrator;Designer;Developer, bac... 25.0 Slightly satisfied Languages, frameworks, and other technologies ... Never Fairly important
8387 Lithuania 29.0 Man Bachelor’s degree (B.A., B.S., B.Eng., etc.) Information systems, information technology, o... Yes 20.0 2.0 4.0 Bash/Shell/PowerShell;HTML/CSS;Python C Once a year Play games;Visit Stack Overflow;Go for a walk ... Independent contractor, freelancer, or self-em... Developer, back-end;Developer, front-end;Devel... 50.0 Slightly satisfied Languages, frameworks, and other technologies ... Often: 1-2 days per week or more Somewhat important
40096 Sweden NaN NaN NaN NaN Yes NaN NaN NaN C;C#;Java;JavaScript JavaScript;Rust;Scala;TypeScript Every few months NaN Employed full-time NaN NaN NaN NaN NaN NaN
2318 Ukraine 27.0 Man Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Computer science, computer engineering, or sof... Yes 11.0 6.0 4.0 HTML/CSS;Java;Kotlin C#;Kotlin Once a year Play games;Visit Stack Overflow;Go for a walk ... Employed full-time Developer, mobile 40.0 Slightly satisfied Flex time or a flexible schedule;Languages, fr... Sometimes: 1-2 days per month but less than we... Fairly important
25393 India 23.0 Man Master’s degree (M.A., M.S., M.Eng., MBA, etc.) Computer science, computer engineering, or sof... Yes 17.0 5.0 NaN C;C++;Python C++;Python Once a year Meditate;Call a coworker or friend;Visit Stack... Student NaN NaN NaN Languages, frameworks, and other technologies ... NaN NaN
41107 United Kingdom NaN NaN NaN NaN Yes NaN NaN NaN Assembly;C Assembly;C Once a year NaN Employed full-time NaN NaN NaN NaN NaN NaN

Exploratory Analysis and Visualization

Before we ask questions about the survey responses, it would help to understand the respondents' demographics, i.e., country, age, gender, education level, employment level, etc. It's essential to explore these variables to understand how representative the survey is of the worldwide programming community.

Let us start by setting up some parameters for the plots that we are going to create

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (13, 8)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Country

Let's look at the number of countries from which there are responses in the survey and plot the ten countries with the highest number of responses.

schema.Country
'Where do you live?'
survey_df.Country.nunique()
183

We can identify the countries with the highest number of respondents using the value_counts method.

top_countries = survey_df.Country.value_counts().head(15)
top_countries
United States         12370
India                  8360
United Kingdom         3880
Germany                3864
Canada                 2174
France                 1884
Brazil                 1804
Netherlands            1332
Poland                 1259
Australia              1199
Spain                  1157
Italy                  1115
Russian Federation     1085
Sweden                  879
Pakistan                802
Name: Country, dtype: int64

We can visualize this information using a bar chart.

plt.xticks(rotation=75)
plt.title(schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);

png

It appears that a disproportionately high number of respondents are from the US and India, probably because the survey is in English, and these countries have the highest English-speaking populations. We can already see that the survey may not be representative of the global programming community - especially from non-English speaking countries. Programmers from non-English speaking countries are almost certainly underrepresented.

Age

The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it.

plt.title(schema.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')
plt.hist(survey_df.Age, bins=np.arange(10, 90, 5));

png

It appears that a large percentage of respondents are 20-45 years old. It's somewhat representative of the programming community in general. Many young people have taken up computer science as their field of study or profession in the last 20 years.

Gender

Let's look at the distribution of responses for the Gender. It's a well-known fact that women and non-binary genders are underrepresented in the programming community, so we might expect to see a skewed distribution here.

schema.Gender
'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.'
gender_counts = survey_df.Gender.value_counts()
gender_counts
Man                                                  45891
Woman                                                 3833
Non-binary, genderqueer, or gender non-conforming      382
Name: Gender, dtype: int64

A pie chart would be a great way to visualize the distribution.

plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180)
plt.title(schema.Gender)
Text(0.5, 1.0, 'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.')

png

Only about 8% of survey respondents who have answered the question identify as women or non-binary. This number is lower than the overall percentage of women & non-binary genders in the programming community - which is estimated to be around 12%.

Education Level

Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this. We'll use a horizontal bar plot here.

Ed_pct = survey_df.EdLevel.value_counts() * 100 / survey_df.EdLevel.count()
sns.barplot(x=Ed_pct, y=Ed_pct.index)
plt.title(schema['EdLevel'])
plt.ylabel(None);

png

It appears that well over half of the respondents hold a bachelor's or master's degree, so most programmers seem to have some college education. However, it's not clear from this graph alone if they hold a degree in computer science.

Let's also plot undergraduate majors, but this time we'll convert the numbers into percentages and sort the values to make it easier to visualize the order.

schema.UndergradMajor
'What was your primary field of study?'
UnderM_pct = survey_df.UndergradMajor.value_counts () * 100 / survey_df.UndergradMajor.count()
sns.barplot(x=UnderM_pct, y=UnderM_pct.index)

plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');

png

It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer.

Employment

Freelancing or contract work is a common choice among programmers, so it would be interesting to compare the breakdown between full-time, part-time, and freelance work. Let's visualize the data from the Employment column.

schema.Employment
'Which of the following best describes your current employment status?'
(survey_df.Employment.value_counts(normalize=True, ascending=True)*100.).plot(kind='barh')
plt.title(schema.Employment)
plt.xlabel('Percentage');

png

It appears that close to 10% of respondents are employed part time or as freelancers.

The DevType field contains information about the roles held by respondents. Since the question allows multiple answers, the column contains lists of values separated by a semi-colon ;, making it a bit harder to analyze directly.

schema.DevType
'Which of the following describe you? Please select all that apply.'
survey_df.DevType.value_counts()
Developer, full-stack                                                                                                                                                           4395
Developer, back-end                                                                                                                                                             3056
Developer, back-end;Developer, front-end;Developer, full-stack                                                                                                                  2214
Developer, back-end;Developer, full-stack                                                                                                                                       1465
Developer, front-end                                                                                                                                                            1390
                                                                                                                                                                                ... 
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Developer, QA or test;Senior executive/VP                                                    1
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Product manager;Senior executive/VP                                                          1
Developer, back-end;Developer, full-stack;Developer, mobile;DevOps specialist;Educator;System administrator                                                                        1
Data or business analyst;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, mobile;Engineering manager       1
Data or business analyst;Developer, mobile;Senior executive/VP;System administrator                                                                                                1
Name: DevType, Length: 8212, dtype: int64

Let's define a helper function that turns a column containing lists of values (like survey_df.DevType) into a data frame with one column for each possible option.

def split_multicolumn(col_series):
    result_df = col_series.to_frame()
    options = []
    # Iterate over the column
    for idx, value  in col_series[col_series.notnull()].iteritems():
        # Break each value into list of options
        for option in value.split(';'):
            # Add the option as a column to result
            if not option in result_df.columns:
                options.append(option)
                result_df[option] = False
            # Mark the value in the option column as True
            result_df.at[idx, option] = True
    return result_df[options] 
dev_type_df = split_multicolumn(survey_df.DevType)
dev_type_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Developer, desktop or enterprise applications Developer, full-stack Developer, mobile Designer Developer, front-end Developer, back-end Developer, QA or test DevOps specialist Developer, game or graphics Database administrator ... System administrator Engineering manager Product manager Data or business analyst Academic researcher Data scientist or machine learning specialist Scientist Senior executive/VP Engineer, site reliability Marketing or sales professional
0 True True False False False False False False False False ... False False False False False False False False False False
1 False True True False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
64456 False False False False False False False False False False ... False False False False False False False True False False
64457 False False False False False False False False False False ... False False False False False False False False False False
64458 False False False False False False False False False False ... False False False False False False False False False False
64459 False False False False False False False False False False ... False False False False False False False False False False
64460 False False False False False False False False False False ... False False False False False False False False False False

64291 rows × 23 columns

The dev_type_df has one column for each option that can be selected as a response. If a respondent has chosen an option, the corresponding column's value is True. Otherwise, it is False.

We can now use the column-wise totals to identify the most common roles.

dev_type_totals = dev_type_df.sum().sort_values(ascending=False)
dev_type_totals
Developer, back-end                              26991
Developer, full-stack                            26910
Developer, front-end                             18124
Developer, desktop or enterprise applications    11686
Developer, mobile                                 9404
DevOps specialist                                 5913
Database administrator                            5655
Designer                                          5260
System administrator                              5183
Developer, embedded applications or devices       4700
Data or business analyst                          3969
Data scientist or machine learning specialist     3937
Developer, QA or test                             3892
Engineer, data                                    3699
Academic researcher                               3501
Educator                                          2894
Developer, game or graphics                       2749
Engineering manager                               2698
Product manager                                   2470
Scientist                                         2058
Engineer, site reliability                        1920
Senior executive/VP                               1291
Marketing or sales professional                    624
dtype: int64
plt.figure(figsize=(12, 12)) 
sns.barplot(x=dev_type_totals, y=dev_type_totals.index)
plt.title('How Developers identify their roles?')
plt.xlabel('Count')
plt.ylabel(None);

png

As one might expect, the most common roles include "Developer" in the name.

Asking and Answering Questions

We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset. Let's ask some specific questions and try to answer them using data frame operations and visualizations.

Q: What are the most popular programming languages in 2020?

To answer, this we can use the LanguageWorkedWith column. Similar to DevType, respondents were allowed to choose multiple options here.

survey_df.LanguageWorkedWith
0                                   C#;HTML/CSS;JavaScript
1                                         JavaScript;Swift
2                                 Objective-C;Python;Swift
3                                                      NaN
4                                        HTML/CSS;Ruby;SQL
                               ...                        
64456                                                  NaN
64457    Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458                                                  NaN
64459                                             HTML/CSS
64460                      C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64291, dtype: object
languages_worked_df = split_multicolumn(survey_df.LanguageWorkedWith)
languages_worked_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
C# HTML/CSS JavaScript Swift Objective-C Python Ruby SQL Java PHP ... VBA Perl Scala C++ Go Haskell Rust Dart Julia Assembly
0 True True True False False False False False False False ... False False False False False False False False False False
1 False False True True False False False False False False ... False False False False False False False False False False
2 False False False True True True False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False True False False False False True True False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
64456 False False False False False False False False False False ... False False False False False False False False False False
64457 True True True True True True True True True True ... True True True True True True True True True True
64458 False False False False False False False False False False ... False False False False False False False False False False
64459 False True False False False False False False False False ... False False False False False False False False False False
64460 True True True False False False False True True False ... False False False False False False False False False False

64291 rows × 25 columns

It appears that a total of 25 languages were included among the options. Let's aggregate these to identify the percentage of respondents who selected each language.

languages_worked_pct = languages_worked_df.mean().sort_values(ascending=False) * 100
languages_worked_pct
JavaScript               59.896409
HTML/CSS                 55.805634
SQL                      48.445350
Python                   39.002349
Java                     35.620849
Bash/Shell/PowerShell    29.240485
C#                       27.801714
PHP                      23.126099
TypeScript               22.463486
C++                      21.111820
C                        19.234419
Go                        7.756918
Kotlin                    6.885878
Ruby                      6.223266
Assembly                  5.442441
VBA                       5.389557
Swift                     5.224682
R                         5.059806
Rust                      4.496741
Objective-C               3.600815
Dart                      3.513711
Scala                     3.148186
Perl                      2.754662
Haskell                   1.858736
Julia                     0.779269
dtype: float64

We can plot this information using a horizontal bar chart.

plt.figure(figsize=(12, 12))
sns.barplot(x=languages_worked_pct, y=languages_worked_pct.index)
plt.title("Languages used in the past year");
plt.xlabel('Percentage');

png

Perhaps unsurprisingly, Javascript & HTML/CSS comes out at the top as web development is one of today's most sought skills. It also happens to be one of the easiest to get started. SQL is necessary for working with relational databases, so it's no surprise that most programmers work with SQL regularly. Python seems to be the popular choice for other forms of development, beating out Java, which was the industry standard for server & application development for over two decades.

Q: Which languages are the most people interested to learn over the next year?

For this, we can use the LanguageDesireNextYear column, with similar processing as the previous one.

languages_interested_df = split_multicolumn(survey_df.LanguageDesireNextYear)
languages_interested_pct = languages_interested_df.mean().sort_values(ascending=False) * 100
languages_interested_pct
Python                   41.150394
JavaScript               40.430231
HTML/CSS                 32.032477
SQL                      30.803689
TypeScript               26.456269
C#                       21.060491
Java                     20.464762
Go                       19.433513
Bash/Shell/PowerShell    18.058515
Rust                     16.271329
C++                      15.014543
Kotlin                   14.761009
PHP                      10.945544
C                         9.362119
Swift                     8.693285
Dart                      7.308955
R                         6.571682
Ruby                      6.423916
Scala                     5.327340
Haskell                   4.594733
Assembly                  3.767246
Julia                     2.541569
Objective-C               2.339363
Perl                      1.760744
VBA                       1.608312
dtype: float64
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_interested_pct, y=languages_interested_pct.index)
plt.title("Languages people are intersted in learning over the next year");
plt.xlabel('Percentage');

png

Once again, it's not surprising that Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for a variety of domains: application development, numerical computing, data analysis, machine learning, big data, cloud automation, web scraping, scripting, etc. We're using Python for this very analysis, so we're in good company!

Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?

While this question may seem tricky at first, it's straightforward to solve using Pandas array operations. Here's what we can do:

  • Create a new data frame languages_loved_df that contains a True value for a language only if the corresponding values in languages_worked_df and languages_interested_df are both True
  • Take the column-wise sum of languages_loved_df and divide it by the column-wise sum of languages_worked_df to get the percentage of respondents who "love" the language
  • Sort the results in decreasing order and plot a horizontal bar graph
languages_loved_df = languages_worked_df & languages_interested_df
languages_loved_pct = (languages_loved_df.sum() * 100/ languages_worked_df.sum()).sort_values(ascending=False)
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_loved_pct, y=languages_loved_pct.index)
plt.title("Most loved languages")
plt.xlabel('Percentage');

png

Rust has been StackOverflow's most-loved language for four years in a row. The second most-loved language is TypeScript, a popular alternative to JavaScript for web development.

Python features at number 3, despite already being one of the most widely-used languages in the world. Python has a solid foundation, is easy to learn & use, has a large ecosystem of domain-specific libraries, and a massive worldwide community.

Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.

To answer this question, we'll need to use the groupby data frame method to aggregate the rows for each country. We'll also need to filter the results to only include the countries with more than 250 respondents.

countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)
h_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)
h_response_countries_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
WorkWeekHrs
Country
Iran 44.337748
Israel 43.915094
China 42.150000
United States 41.799858
Greece 41.402724
Viet Nam 41.391667
South Africa 41.023460
Turkey 40.982143
Sri Lanka 40.612245
New Zealand 40.457551
Belgium 40.444444
Canada 40.208837
Hungary 40.194340
India 40.100349
Bangladesh 40.097458
h_response_countries_df.plot(kind='bar')
plt.title('which countries do developers work the highest number of hours per week?')
plt.xticks(rotation=75);

png

The Asian countries like Iran, China, and Israel have the highest working hours, followed by the United States. However, there isn't too much variation overall, and the average working hours seem to be around 40 hours per week.

Q: How important is it to start young to build a career in programming?

Let's create a scatter plot of Age vs. YearsCodePro (i.e., years of coding experience) to answer this question.

schema.YearsCodePro
'NOT including education, how many years have you coded professionally (as a part of your work)?'
sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)
plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");

png

You can see points all over the graph, which indicates that you can start programming professionally at any age. Many people who have been coding for several decades professionally also seem to enjoy it as a hobby.

We can also view the distribution of the Age1stCode column to see when the respondents tried programming for the first time.

plt.title(schema.Age1stCode)
ax = sns.histplot(x=survey_df.Age1stCode, bins=30, kde=True);
ax.lines[0].set_color('crimson');

png

As you might expect, most people seem to have had some exposure to programming before the age of 40. However, but there are people of all ages and walks of life learning to code.

summary

We've drawn many inferences from the survey. Here's a summary of a few of them:

  • Based on the survey respondents' demographics, we can infer that the survey is somewhat representative of the overall programming community. However, it has fewer responses from programmers in non-English-speaking countries and women & non-binary genders.

  • The programming community is not as diverse as it can be. Although things are improving, we should make more efforts to support & encourage underrepresented communities, whether in terms of age, country, race, gender, or otherwise.

  • Although most programmers hold a college degree, a reasonably large percentage did not have computer science as their college major. Hence, a computer science degree isn't compulsory for learning to code or building a career in programming.

  • A significant percentage of programmers either work part-time or as freelancers, which can be a great way to break into the field, especially when you're just getting started.

  • Javascript & HTML/CSS are the most used programming languages in 2020, closely followed by SQL & Python.

  • Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for various domains.

  • Rust and TypeScript are the most "loved" languages in 2020, both of which have small but fast-growing communities. Python is a close third, despite already being a widely used language.

  • Programmers worldwide seem to be working for around 40 hours a week on average, with slight variations by country.

  • You can learn and start programming professionally at any age. You're likely to have a long and fulfilling career if you also enjoy programming as a hobby.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published