analyzation for stackoverflow-survey with python

Introduction

we'll analyze the StackOverflow developer survey dataset. The dataset contains responses to an annual survey conducted by StackOverflow. You can find the raw data & official analysis here: https://insights.stackoverflow.com/survey.

Importing Libraries

The libraries used in this notebook are:

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Let's load the CSV files using the Pandas library. We'll use the name survey_raw_df for the data frame to indicate this is unprocessed data that we might clean, filter, and modify to prepare a data frame ready for analysis.

survey_raw_df = pd.read_csv('survey_results_public.csv')
survey_raw_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Respondent	MainBranch	Hobbyist	Age	Age1stCode	CompFreq	CompTotal	ConvertedComp	Country	CurrencyDesc	...	SurveyEase	SurveyLength	Trans	UndergradMajor	WebframeDesireNextYear	WebframeWorkedWith	WelcomeChange	WorkWeekHrs	YearsCode	YearsCodePro
0	1	I am a developer by profession	Yes	NaN	13	Monthly	NaN	NaN	Germany	European Euro	...	Neither easy nor difficult	Appropriate in length	No	Computer science, computer engineering, or sof...	ASP.NET Core	ASP.NET;ASP.NET Core	Just as welcome now as I felt last year	50.0	36	27
1	2	I am a developer by profession	No	NaN	19	NaN	NaN	NaN	United Kingdom	Pound sterling	...	NaN	NaN	NaN	Computer science, computer engineering, or sof...	NaN	NaN	Somewhat more welcome now than last year	NaN	7	4
2	3	I code primarily as a hobby	Yes	NaN	15	NaN	NaN	NaN	Russian Federation	NaN	...	Neither easy nor difficult	Appropriate in length	NaN	NaN	NaN	NaN	Somewhat more welcome now than last year	NaN	4	NaN
3	4	I am a developer by profession	Yes	25.0	18	NaN	NaN	NaN	Albania	Albanian lek	...	NaN	NaN	No	Computer science, computer engineering, or sof...	NaN	NaN	Somewhat less welcome now than last year	40.0	7	4
4	5	I used to be a developer by profession, but no...	Yes	31.0	16	NaN	NaN	NaN	United States	NaN	...	Easy	Too short	No	Computer science, computer engineering, or sof...	Django;Ruby on Rails	Ruby on Rails	Just as welcome now as I felt last year	NaN	15	8
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
64456	64858	NaN	Yes	NaN	16	NaN	NaN	NaN	United States	NaN	...	NaN	NaN	NaN	Computer science, computer engineering, or sof...	NaN	NaN	NaN	NaN	10	Less than 1 year
64457	64867	NaN	Yes	NaN	NaN	NaN	NaN	NaN	Morocco	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
64458	64898	NaN	Yes	NaN	NaN	NaN	NaN	NaN	Viet Nam	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
64459	64925	NaN	Yes	NaN	NaN	NaN	NaN	NaN	Poland	NaN	...	NaN	NaN	NaN	NaN	Angular;Angular.js;React.js	NaN	NaN	NaN	NaN	NaN
64460	65112	NaN	Yes	NaN	NaN	NaN	NaN	NaN	Spain	NaN	...	NaN	NaN	NaN	Computer science, computer engineering, or sof...	ASP.NET Core;jQuery	Angular;Angular.js;ASP.NET Core;jQuery	NaN	NaN	NaN	NaN

64461 rows × 61 columns

The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized to remove personally identifiable information, and each respondent has been assigned a randomized respondent ID.

Let's view the list of columns in the data frame.

survey_raw_df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'Age', 'Age1stCode', 'CompFreq',
       'CompTotal', 'ConvertedComp', 'Country', 'CurrencyDesc',
       'CurrencySymbol', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'DevType', 'EdLevel', 'Employment', 'Ethnicity', 'Gender', 'JobFactors',
       'JobSat', 'JobSeek', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'MiscTechDesireNextYear', 'MiscTechWorkedWith',
       'NEWCollabToolsDesireNextYear', 'NEWCollabToolsWorkedWith', 'NEWDevOps',
       'NEWDevOpsImpt', 'NEWEdImpt', 'NEWJobHunt', 'NEWJobHuntResearch',
       'NEWLearn', 'NEWOffTopic', 'NEWOnboardGood', 'NEWOtherComms',
       'NEWOvertime', 'NEWPurchaseResearch', 'NEWPurpleLink', 'NEWSOSites',
       'NEWStuck', 'OpSys', 'OrgSize', 'PlatformDesireNextYear',
       'PlatformWorkedWith', 'PurchaseWhat', 'Sexuality', 'SOAccount',
       'SOComm', 'SOPartFreq', 'SOVisitFreq', 'SurveyEase', 'SurveyLength',
       'Trans', 'UndergradMajor', 'WebframeDesireNextYear',
       'WebframeWorkedWith', 'WelcomeChange', 'WorkWeekHrs', 'YearsCode',
       'YearsCodePro'],
      dtype='object')

It appears that shortcodes for questions have been used as column names.

We can refer to the schema file to see the full text of each question. The schema file contains only two columns: Column and QuestionText. We can load it as Pandas Series with Column as the index and the QuestionText as the value.

survey_re_schema = pd.read_csv('survey_results_schema.csv', index_col='Column')
survey_re_schema

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	QuestionText
Column
Respondent	Randomized respondent ID number (not in order ...
MainBranch	Which of the following options best describes ...
Hobbyist	Do you code as a hobby?
Age	What is your age (in years)? If you prefer not...
Age1stCode	At what age did you write your first line of c...
...	...
WebframeWorkedWith	Which web frameworks have you done extensive d...
WelcomeChange	Compared to last year, how welcome do you feel...
WorkWeekHrs	On average, how many hours per week do you wor...
YearsCode	Including any education, how many years have y...
YearsCodePro	NOT including education, how many years have y...

61 rows × 1 columns

schema_raw = survey_re_schema.QuestionText
schema_raw

Column
Respondent            Randomized respondent ID number (not in order ...
MainBranch            Which of the following options best describes ...
Hobbyist                                        Do you code as a hobby?
Age                   What is your age (in years)? If you prefer not...
Age1stCode            At what age did you write your first line of c...
                                            ...                        
WebframeWorkedWith    Which web frameworks have you done extensive d...
WelcomeChange         Compared to last year, how welcome do you feel...
WorkWeekHrs           On average, how many hours per week do you wor...
YearsCode             Including any education, how many years have y...
YearsCodePro          NOT including education, how many years have y...
Name: QuestionText, Length: 61, dtype: object

We can now use schema_raw to retrieve the full question text for any column in survey_raw_df.

schema_raw['YearsCodePro']

'NOT including education, how many years have you coded professionally (as a part of your work)?'

We've now loaded the dataset. We're ready to move on to the next step of preprocessing & cleaning the data for our analysis.

Data Preparation & Cleaning

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

Demographics of the survey respondents and the global programming community
Distribution of programming skills, experience, and preferences
Employment-related information, preferences, and opinions

Let's select a subset of columns with the relevant data for our analysis.

selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

len(selected_columns)

Let's extract a copy of the data from these columns into a new data frame survey_df. We can continue to modify further without affecting the original data frame.

survey_df = survey_raw_df[selected_columns].copy()

schema = schema_raw[selected_columns]

Let's view some basic information about the data frame.

survey_df.shape

(64461, 20)

survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64461 entries, 0 to 64460
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 64072 non-null  object 
 1   Age                     45446 non-null  float64
 2   Gender                  50557 non-null  object 
 3   EdLevel                 57431 non-null  object 
 4   UndergradMajor          50995 non-null  object 
 5   Hobbyist                64416 non-null  object 
 6   Age1stCode              57900 non-null  object 
 7   YearsCode               57684 non-null  object 
 8   YearsCodePro            46349 non-null  object 
 9   LanguageWorkedWith      57378 non-null  object 
 10  LanguageDesireNextYear  54113 non-null  object 
 11  NEWLearn                56156 non-null  object 
 12  NEWStuck                54983 non-null  object 
 13  Employment              63854 non-null  object 
 14  DevType                 49370 non-null  object 
 15  WorkWeekHrs             41151 non-null  float64
 16  JobSat                  45194 non-null  object 
 17  JobFactors              49349 non-null  object 
 18  NEWOvertime             43231 non-null  object 
 19  NEWEdImpt               48465 non-null  object 
dtypes: float64(2), object(18)
memory usage: 9.8+ MB

Most columns have the data type object, either because they contain values of different types or contain empty values (NaN). It appears that every column contains some empty values since the Non-Null count for every column is lower than the total number of rows (64461). We'll need to deal with empty values and manually adjust the data type for each column on a case-by-case basis.

Only two of the columns were detected as numeric columns (Age and WorkWeekHrs), even though a few other columns have mostly numeric values. To make our analysis easier, let's convert some other columns into numeric data types while ignoring any non-numeric value. The non-numeric are converted to NaN.

survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

Let's now view some basic statistics about numeric columns.

survey_df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Age1stCode	YearsCode	YearsCodePro	WorkWeekHrs
count	45446.000000	57473.000000	56784.000000	44133.000000	41151.000000
mean	30.834111	15.476572	12.782051	8.869667	40.782174
std	9.585392	5.114081	9.490657	7.759961	17.816383
min	1.000000	5.000000	1.000000	1.000000	1.000000
25%	24.000000	12.000000	6.000000	3.000000	40.000000
50%	29.000000	15.000000	10.000000	6.000000	40.000000
75%	35.000000	18.000000	17.000000	12.000000	44.000000
max	279.000000	85.000000	50.000000	50.000000	475.000000

There seems to be a problem with the age column, as the minimum value is 1 and the maximum is 279. This is a common issue with surveys: responses may contain invalid values due to accidental or intentional errors while responding. A simple fix would be to ignore the rows where the age is higher than 90 years or lower than 10 years as invalid survey responses. We can do this using the .drop method

survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 90].index, inplace=True)

The same holds for WorkWeekHrs. Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).

survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

survey_df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Age	Age1stCode	YearsCode	YearsCodePro	WorkWeekHrs
count	45304.000000	57315.000000	56625.000000	43987.000000	40995.000000
mean	30.810193	15.475635	12.784336	8.873099	40.024497
std	9.429350	5.115102	9.494409	7.762089	10.628110
min	10.000000	5.000000	1.000000	1.000000	1.000000
25%	24.000000	12.000000	6.000000	3.000000	40.000000
50%	29.000000	15.000000	10.000000	6.000000	40.000000
75%	35.000000	18.000000	17.000000	12.000000	43.000000
max	89.000000	85.000000	50.000000	50.000000	140.000000

The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.

survey_df.Gender.value_counts()

Man                                                            45891
Woman                                                           3833
Non-binary, genderqueer, or gender non-conforming                382
Man;Non-binary, genderqueer, or gender non-conforming            121
Woman;Non-binary, genderqueer, or gender non-conforming           92
Woman;Man                                                         73
Woman;Man;Non-binary, genderqueer, or gender non-conforming       23
Name: Gender, dtype: int64

survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

survey_df.Gender.value_counts()

Man                                                  45891
Woman                                                 3833
Non-binary, genderqueer, or gender non-conforming      382
Name: Gender, dtype: int64

We've now cleaned up and prepared the dataset for analysis. Let's take a look at a sample of rows from the data frame.

survey_df.sample(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Country	Age	Gender	EdLevel	UndergradMajor	Hobbyist	Age1stCode	YearsCode	YearsCodePro	LanguageWorkedWith	LanguageDesireNextYear	NEWLearn	NEWStuck	Employment	DevType	WorkWeekHrs	JobSat	JobFactors	NEWOvertime	NEWEdImpt
29326	Switzerland	39.0	Man	Master’s degree (M.A., M.S., M.Eng., MBA, etc.)	Computer science, computer engineering, or sof...	No	33.0	4.0	1.0	Bash/Shell/PowerShell;Java;SQL	Bash/Shell/PowerShell;Java;Python;Rust;SQL	Once a year	Call a coworker or friend;Visit Stack Overflow...	Employed full-time	Developer, back-end	42.0	Slightly satisfied	Diversity of the company or organization;Langu...	Sometimes: 1-2 days per month but less than we...	Critically important
28044	United States	33.0	Man	Bachelor’s degree (B.A., B.S., B.Eng., etc.)	A humanities discipline (such as literature, h...	Yes	27.0	6.0	2.0	HTML/CSS;JavaScript;Python;SQL;TypeScript	Swift	Once every few years	Call a coworker or friend;Visit Stack Overflow...	Employed full-time	Developer, front-end;Developer, full-stack;Dev...	41.0	Very dissatisfied	Flex time or a flexible schedule;Languages, fr...	Rarely: 1-2 days per year or less	Fairly important
24261	United States	37.0	Man	Bachelor’s degree (B.A., B.S., B.Eng., etc.)	A humanities discipline (such as literature, h...	Yes	33.0	4.0	3.0	HTML/CSS;Java;JavaScript;SQL	JavaScript;Python;TypeScript	Once a year	Meditate;Call a coworker or friend;Visit Stack...	Employed full-time	Developer, back-end;Developer, front-end;Devel...	35.0	Very satisfied	Flex time or a flexible schedule;Office enviro...	Sometimes: 1-2 days per month but less than we...	Not at all important/not necessary
27715	Germany	37.0	Man	Master’s degree (M.A., M.S., M.Eng., MBA, etc.)	Computer science, computer engineering, or sof...	Yes	17.0	16.0	9.0	Bash/Shell/PowerShell;C;C#;Java;JavaScript;Obj...	Rust;TypeScript	Once a year	Visit Stack Overflow;Watch help / tutorial vid...	Employed full-time	Developer, desktop or enterprise applications;...	40.0	Slightly dissatisfied	NaN	Sometimes: 1-2 days per month but less than we...	Very important
20230	United States	15.0	Man	Secondary school (e.g. American high school, G...	NaN	Yes	9.0	7.0	3.0	Bash/Shell/PowerShell;C;C++;Dart;HTML/CSS;Java...	C;C++;Dart;Java;JavaScript;Kotlin;SQL;Swift	Every few months	Call a coworker or friend;Visit Stack Overflow...	Independent contractor, freelancer, or self-em...	Database administrator;Designer;Developer, bac...	25.0	Slightly satisfied	Languages, frameworks, and other technologies ...	Never	Fairly important
8387	Lithuania	29.0	Man	Bachelor’s degree (B.A., B.S., B.Eng., etc.)	Information systems, information technology, o...	Yes	20.0	2.0	4.0	Bash/Shell/PowerShell;HTML/CSS;Python	C	Once a year	Play games;Visit Stack Overflow;Go for a walk ...	Independent contractor, freelancer, or self-em...	Developer, back-end;Developer, front-end;Devel...	50.0	Slightly satisfied	Languages, frameworks, and other technologies ...	Often: 1-2 days per week or more	Somewhat important
40096	Sweden	NaN	NaN	NaN	NaN	Yes	NaN	NaN	NaN	C;C#;Java;JavaScript	JavaScript;Rust;Scala;TypeScript	Every few months	NaN	Employed full-time	NaN	NaN	NaN	NaN	NaN	NaN
2318	Ukraine	27.0	Man	Master’s degree (M.A., M.S., M.Eng., MBA, etc.)	Computer science, computer engineering, or sof...	Yes	11.0	6.0	4.0	HTML/CSS;Java;Kotlin	C#;Kotlin	Once a year	Play games;Visit Stack Overflow;Go for a walk ...	Employed full-time	Developer, mobile	40.0	Slightly satisfied	Flex time or a flexible schedule;Languages, fr...	Sometimes: 1-2 days per month but less than we...	Fairly important
25393	India	23.0	Man	Master’s degree (M.A., M.S., M.Eng., MBA, etc.)	Computer science, computer engineering, or sof...	Yes	17.0	5.0	NaN	C;C++;Python	C++;Python	Once a year	Meditate;Call a coworker or friend;Visit Stack...	Student	NaN	NaN	NaN	Languages, frameworks, and other technologies ...	NaN	NaN
41107	United Kingdom	NaN	NaN	NaN	NaN	Yes	NaN	NaN	NaN	Assembly;C	Assembly;C	Once a year	NaN	Employed full-time	NaN	NaN	NaN	NaN	NaN	NaN

Exploratory Analysis and Visualization

Before we ask questions about the survey responses, it would help to understand the respondents' demographics, i.e., country, age, gender, education level, employment level, etc. It's essential to explore these variables to understand how representative the survey is of the worldwide programming community.

Let us start by setting up some parameters for the plots that we are going to create

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (13, 8)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Country

Let's look at the number of countries from which there are responses in the survey and plot the ten countries with the highest number of responses.

schema.Country

'Where do you live?'

survey_df.Country.nunique()

We can identify the countries with the highest number of respondents using the value_counts method.

top_countries = survey_df.Country.value_counts().head(15)
top_countries

United States         12370
India                  8360
United Kingdom         3880
Germany                3864
Canada                 2174
France                 1884
Brazil                 1804
Netherlands            1332
Poland                 1259
Australia              1199
Spain                  1157
Italy                  1115
Russian Federation     1085
Sweden                  879
Pakistan                802
Name: Country, dtype: int64

We can visualize this information using a bar chart.

plt.xticks(rotation=75)
plt.title(schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);

It appears that a disproportionately high number of respondents are from the US and India, probably because the survey is in English, and these countries have the highest English-speaking populations. We can already see that the survey may not be representative of the global programming community - especially from non-English speaking countries. Programmers from non-English speaking countries are almost certainly underrepresented.

Age

The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it.

plt.title(schema.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')
plt.hist(survey_df.Age, bins=np.arange(10, 90, 5));

It appears that a large percentage of respondents are 20-45 years old. It's somewhat representative of the programming community in general. Many young people have taken up computer science as their field of study or profession in the last 20 years.

Gender

Let's look at the distribution of responses for the Gender. It's a well-known fact that women and non-binary genders are underrepresented in the programming community, so we might expect to see a skewed distribution here.

schema.Gender

'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.'

gender_counts = survey_df.Gender.value_counts()
gender_counts

Man                                                  45891
Woman                                                 3833
Non-binary, genderqueer, or gender non-conforming      382
Name: Gender, dtype: int64

A pie chart would be a great way to visualize the distribution.

plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180)
plt.title(schema.Gender)

Text(0.5, 1.0, 'Which of the following describe you, if any? Please check all that apply. If you prefer not to answer, you may leave this question blank.')

Only about 8% of survey respondents who have answered the question identify as women or non-binary. This number is lower than the overall percentage of women & non-binary genders in the programming community - which is estimated to be around 12%.

Education Level

Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this. We'll use a horizontal bar plot here.

Ed_pct = survey_df.EdLevel.value_counts() * 100 / survey_df.EdLevel.count()
sns.barplot(x=Ed_pct, y=Ed_pct.index)
plt.title(schema['EdLevel'])
plt.ylabel(None);

It appears that well over half of the respondents hold a bachelor's or master's degree, so most programmers seem to have some college education. However, it's not clear from this graph alone if they hold a degree in computer science.

Let's also plot undergraduate majors, but this time we'll convert the numbers into percentages and sort the values to make it easier to visualize the order.

schema.UndergradMajor

'What was your primary field of study?'

UnderM_pct = survey_df.UndergradMajor.value_counts () * 100 / survey_df.UndergradMajor.count()
sns.barplot(x=UnderM_pct, y=UnderM_pct.index)

plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');

It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer.

Employment

Freelancing or contract work is a common choice among programmers, so it would be interesting to compare the breakdown between full-time, part-time, and freelance work. Let's visualize the data from the Employment column.

schema.Employment

'Which of the following best describes your current employment status?'

(survey_df.Employment.value_counts(normalize=True, ascending=True)*100.).plot(kind='barh')
plt.title(schema.Employment)
plt.xlabel('Percentage');

It appears that close to 10% of respondents are employed part time or as freelancers.

The DevType field contains information about the roles held by respondents. Since the question allows multiple answers, the column contains lists of values separated by a semi-colon ;, making it a bit harder to analyze directly.

schema.DevType

'Which of the following describe you? Please select all that apply.'

survey_df.DevType.value_counts()

Developer, full-stack                                                                                                                                                           4395
Developer, back-end                                                                                                                                                             3056
Developer, back-end;Developer, front-end;Developer, full-stack                                                                                                                  2214
Developer, back-end;Developer, full-stack                                                                                                                                       1465
Developer, front-end                                                                                                                                                            1390
                                                                                                                                                                                ... 
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Developer, QA or test;Senior executive/VP                                                    1
Database administrator;Developer, back-end;Developer, front-end;Developer, full-stack;Product manager;Senior executive/VP                                                          1
Developer, back-end;Developer, full-stack;Developer, mobile;DevOps specialist;Educator;System administrator                                                                        1
Data or business analyst;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, mobile;Engineering manager       1
Data or business analyst;Developer, mobile;Senior executive/VP;System administrator                                                                                                1
Name: DevType, Length: 8212, dtype: int64

Let's define a helper function that turns a column containing lists of values (like survey_df.DevType) into a data frame with one column for each possible option.

def split_multicolumn(col_series):
    result_df = col_series.to_frame()
    options = []
    # Iterate over the column
    for idx, value  in col_series[col_series.notnull()].iteritems():
        # Break each value into list of options
        for option in value.split(';'):
            # Add the option as a column to result
            if not option in result_df.columns:
                options.append(option)
                result_df[option] = False
            # Mark the value in the option column as True
            result_df.at[idx, option] = True
    return result_df[options]

dev_type_df = split_multicolumn(survey_df.DevType)
dev_type_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Developer, desktop or enterprise applications	Developer, full-stack	Developer, mobile	Designer	Developer, front-end	Developer, back-end	Developer, QA or test	DevOps specialist	Developer, game or graphics	Database administrator	...	System administrator	Engineering manager	Product manager	Data or business analyst	Academic researcher	Data scientist or machine learning specialist	Scientist	Senior executive/VP	Engineer, site reliability	Marketing or sales professional
0	True	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	True	True	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
64456	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	True	False	False
64457	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64458	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64459	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64460	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

64291 rows × 23 columns

The dev_type_df has one column for each option that can be selected as a response. If a respondent has chosen an option, the corresponding column's value is True. Otherwise, it is False.

We can now use the column-wise totals to identify the most common roles.

dev_type_totals = dev_type_df.sum().sort_values(ascending=False)
dev_type_totals

Developer, back-end                              26991
Developer, full-stack                            26910
Developer, front-end                             18124
Developer, desktop or enterprise applications    11686
Developer, mobile                                 9404
DevOps specialist                                 5913
Database administrator                            5655
Designer                                          5260
System administrator                              5183
Developer, embedded applications or devices       4700
Data or business analyst                          3969
Data scientist or machine learning specialist     3937
Developer, QA or test                             3892
Engineer, data                                    3699
Academic researcher                               3501
Educator                                          2894
Developer, game or graphics                       2749
Engineering manager                               2698
Product manager                                   2470
Scientist                                         2058
Engineer, site reliability                        1920
Senior executive/VP                               1291
Marketing or sales professional                    624
dtype: int64

plt.figure(figsize=(12, 12)) 
sns.barplot(x=dev_type_totals, y=dev_type_totals.index)
plt.title('How Developers identify their roles?')
plt.xlabel('Count')
plt.ylabel(None);

As one might expect, the most common roles include "Developer" in the name.

Asking and Answering Questions

We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset. Let's ask some specific questions and try to answer them using data frame operations and visualizations.

Q: What are the most popular programming languages in 2020?

To answer, this we can use the LanguageWorkedWith column. Similar to DevType, respondents were allowed to choose multiple options here.

survey_df.LanguageWorkedWith

0                                   C#;HTML/CSS;JavaScript
1                                         JavaScript;Swift
2                                 Objective-C;Python;Swift
3                                                      NaN
4                                        HTML/CSS;Ruby;SQL
                               ...                        
64456                                                  NaN
64457    Assembly;Bash/Shell/PowerShell;C;C#;C++;Dart;G...
64458                                                  NaN
64459                                             HTML/CSS
64460                      C#;HTML/CSS;Java;JavaScript;SQL
Name: LanguageWorkedWith, Length: 64291, dtype: object

languages_worked_df = split_multicolumn(survey_df.LanguageWorkedWith)
languages_worked_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	C#	HTML/CSS	JavaScript	Swift	Objective-C	Python	Ruby	SQL	Java	PHP	...	VBA	Perl	Scala	C++	Go	Haskell	Rust	Dart	Julia	Assembly
0	True	True	True	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	True	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	True	True	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	True	False	False	False	False	True	True	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
64456	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64457	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
64458	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64459	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
64460	True	True	True	False	False	False	False	True	True	False	...	False	False	False	False	False	False	False	False	False	False

64291 rows × 25 columns

It appears that a total of 25 languages were included among the options. Let's aggregate these to identify the percentage of respondents who selected each language.

languages_worked_pct = languages_worked_df.mean().sort_values(ascending=False) * 100
languages_worked_pct

JavaScript               59.896409
HTML/CSS                 55.805634
SQL                      48.445350
Python                   39.002349
Java                     35.620849
Bash/Shell/PowerShell    29.240485
C#                       27.801714
PHP                      23.126099
TypeScript               22.463486
C++                      21.111820
C                        19.234419
Go                        7.756918
Kotlin                    6.885878
Ruby                      6.223266
Assembly                  5.442441
VBA                       5.389557
Swift                     5.224682
R                         5.059806
Rust                      4.496741
Objective-C               3.600815
Dart                      3.513711
Scala                     3.148186
Perl                      2.754662
Haskell                   1.858736
Julia                     0.779269
dtype: float64

We can plot this information using a horizontal bar chart.

plt.figure(figsize=(12, 12))
sns.barplot(x=languages_worked_pct, y=languages_worked_pct.index)
plt.title("Languages used in the past year");
plt.xlabel('Percentage');

Perhaps unsurprisingly, Javascript & HTML/CSS comes out at the top as web development is one of today's most sought skills. It also happens to be one of the easiest to get started. SQL is necessary for working with relational databases, so it's no surprise that most programmers work with SQL regularly. Python seems to be the popular choice for other forms of development, beating out Java, which was the industry standard for server & application development for over two decades.

Q: Which languages are the most people interested to learn over the next year?

For this, we can use the LanguageDesireNextYear column, with similar processing as the previous one.

languages_interested_df = split_multicolumn(survey_df.LanguageDesireNextYear)
languages_interested_pct = languages_interested_df.mean().sort_values(ascending=False) * 100
languages_interested_pct

Python                   41.150394
JavaScript               40.430231
HTML/CSS                 32.032477
SQL                      30.803689
TypeScript               26.456269
C#                       21.060491
Java                     20.464762
Go                       19.433513
Bash/Shell/PowerShell    18.058515
Rust                     16.271329
C++                      15.014543
Kotlin                   14.761009
PHP                      10.945544
C                         9.362119
Swift                     8.693285
Dart                      7.308955
R                         6.571682
Ruby                      6.423916
Scala                     5.327340
Haskell                   4.594733
Assembly                  3.767246
Julia                     2.541569
Objective-C               2.339363
Perl                      1.760744
VBA                       1.608312
dtype: float64

plt.figure(figsize=(12, 12))
sns.barplot(x=languages_interested_pct, y=languages_interested_pct.index)
plt.title("Languages people are intersted in learning over the next year");
plt.xlabel('Percentage');

Once again, it's not surprising that Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for a variety of domains: application development, numerical computing, data analysis, machine learning, big data, cloud automation, web scraping, scripting, etc. We're using Python for this very analysis, so we're in good company!

Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?

While this question may seem tricky at first, it's straightforward to solve using Pandas array operations. Here's what we can do:

Create a new data frame languages_loved_df that contains a True value for a language only if the corresponding values in languages_worked_df and languages_interested_df are both True
Take the column-wise sum of languages_loved_df and divide it by the column-wise sum of languages_worked_df to get the percentage of respondents who "love" the language
Sort the results in decreasing order and plot a horizontal bar graph

languages_loved_df = languages_worked_df & languages_interested_df

languages_loved_pct = (languages_loved_df.sum() * 100/ languages_worked_df.sum()).sort_values(ascending=False)

plt.figure(figsize=(12, 12))
sns.barplot(x=languages_loved_pct, y=languages_loved_pct.index)
plt.title("Most loved languages")
plt.xlabel('Percentage');

Rust has been StackOverflow's most-loved language for four years in a row. The second most-loved language is TypeScript, a popular alternative to JavaScript for web development.

Python features at number 3, despite already being one of the most widely-used languages in the world. Python has a solid foundation, is easy to learn & use, has a large ecosystem of domain-specific libraries, and a massive worldwide community.

Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.

To answer this question, we'll need to use the groupby data frame method to aggregate the rows for each country. We'll also need to filter the results to only include the countries with more than 250 respondents.

countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)

h_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)
h_response_countries_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	WorkWeekHrs
Country
Iran	44.337748
Israel	43.915094
China	42.150000
United States	41.799858
Greece	41.402724
Viet Nam	41.391667
South Africa	41.023460
Turkey	40.982143
Sri Lanka	40.612245
New Zealand	40.457551
Belgium	40.444444
Canada	40.208837
Hungary	40.194340
India	40.100349
Bangladesh	40.097458

h_response_countries_df.plot(kind='bar')
plt.title('which countries do developers work the highest number of hours per week?')
plt.xticks(rotation=75);

The Asian countries like Iran, China, and Israel have the highest working hours, followed by the United States. However, there isn't too much variation overall, and the average working hours seem to be around 40 hours per week.

Q: How important is it to start young to build a career in programming?

Let's create a scatter plot of Age vs. YearsCodePro (i.e., years of coding experience) to answer this question.

schema.YearsCodePro

'NOT including education, how many years have you coded professionally (as a part of your work)?'

sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)
plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");

You can see points all over the graph, which indicates that you can start programming professionally at any age. Many people who have been coding for several decades professionally also seem to enjoy it as a hobby.

We can also view the distribution of the Age1stCode column to see when the respondents tried programming for the first time.

plt.title(schema.Age1stCode)
ax = sns.histplot(x=survey_df.Age1stCode, bins=30, kde=True);
ax.lines[0].set_color('crimson');

As you might expect, most people seem to have had some exposure to programming before the age of 40. However, but there are people of all ages and walks of life learning to code.

summary

We've drawn many inferences from the survey. Here's a summary of a few of them:

Based on the survey respondents' demographics, we can infer that the survey is somewhat representative of the overall programming community. However, it has fewer responses from programmers in non-English-speaking countries and women & non-binary genders.
The programming community is not as diverse as it can be. Although things are improving, we should make more efforts to support & encourage underrepresented communities, whether in terms of age, country, race, gender, or otherwise.
Although most programmers hold a college degree, a reasonably large percentage did not have computer science as their college major. Hence, a computer science degree isn't compulsory for learning to code or building a career in programming.
A significant percentage of programmers either work part-time or as freelancers, which can be a great way to break into the field, especially when you're just getting started.
Javascript & HTML/CSS are the most used programming languages in 2020, closely followed by SQL & Python.
Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for various domains.
Rust and TypeScript are the most "loved" languages in 2020, both of which have small but fast-growing communities. Python is a close third, despite already being a widely used language.
Programmers worldwide seem to be working for around 40 hours a week on average, with slight variations by country.
You can learn and start programming professionally at any age. You're likely to have a long and fulfilling career if you also enjoy programming as a hobby.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
png		png
Analysation for stackoverflow-survey with python.ipynb		Analysation for stackoverflow-survey with python.ipynb
README.md		README.md
survey_results_public.zip		survey_results_public.zip
survey_results_schema.csv		survey_results_schema.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

analyzation for stackoverflow-survey with python

Introduction

Importing Libraries

Data Preparation & Cleaning

Exploratory Analysis and Visualization

Country

Age

Gender

Education Level

Employment

Asking and Answering Questions

Q: What are the most popular programming languages in 2020?

Q: Which languages are the most people interested to learn over the next year?

Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?

Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.

Q: How important is it to start young to build a career in programming?

summary

About

Uh oh!

Releases

Packages

Languages

zain2525/Analysation-for-stackoverflow-survey-with-python

Folders and files

Latest commit

History

Repository files navigation

analyzation for stackoverflow-survey with python

Introduction

Importing Libraries

Data Preparation & Cleaning

Exploratory Analysis and Visualization

Country

Age

Gender

Education Level

Employment

Asking and Answering Questions

Q: What are the most popular programming languages in 2020?

Q: Which languages are the most people interested to learn over the next year?

Q: Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?

Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.

Q: How important is it to start young to build a career in programming?

summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages