RESUME_CLASSIFIER

A Multi-Class Classification NLP Project

Introduction

Natural Language Processing (NLP) has gained popularity for multiple reasons and it is an exciting technology that is here to stay for a long time. NLP deals with machines understanding the way humans speak and write the language in their everyday lives. In this repository, I am going over one of the simple projects of that kind: classifying an applicant’s resume.

The conventional techniques of hiring a candidate for a position is becoming more labor intensive, therefore inefficient, because of the growing online recruitment. The companies receive an excessive number of resumes in multiple categories for the vacant positions.

Using some of the NLP and Machine Learning (ML) techniques, categorizing the applicants’ resumes for the available positions can be automated. In this repo, I developed a simplified version of such a multiclass classification in Python using NLP.

Project Approach

1. Importing and Installing Necessary Libraries

I have imported necessary libraries like numpy, pandas, sklearn, nltk etc to use in the code block

2. Uploading and reading the csv file

I acquired the data from the below link and uploaded it as a pandas dataframe. https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset

3. Generic Cleaning of Data

I have used functions like null_values, dtypes, drop_duplicates to do general overall cleaning of the data

4. Encoding and analysing distribution of target column

There are 24 different categories of resume in the target column. Out of 24 categories, information-technology and business-development classes are the largest with 119 resumes, whereas bpo class is the smallest with 21 resumes. I have encoded each category with integers from 0 to 23 using map function

My goal is to develop a machine learning classifier which is going to correctly predict the class of an applicant’s resume. Since I have 24 different categories, I will develop a multi-class classification algorithm

5. NLP steps to convert resume content to simplified text

Data cleaning must be done before vectorizing the text data. The steps I followed for cleaning the text data are,

Removing punctuations and non-ascii characters other that alphabets and numerics
Removing single and 2 letter words
Converting all the uppercase to lowercase texts
stop words removal
lemmatization to break a word down to its root

6. Fit XGBoost model on vectorized data

In NLP problems, we need to convert text data to numbers before applying any machine learning. That process is called the vectorization of the text data. Here vectorization has been done with Bag of Words (BoW) model

I used 1-gram TF-IDF approach and set maximum features limit to 3000. After vectorizing the data, I split the data into train and test data and fit the Extreme Gradient Boosting model to it

The model has been evaluated using the metric AUROC score. The AUROC score arrived is 0.96 which indicates that the model is of high accuracy and a good model

Refer resume_classifier.ipynb for the code block of the above steps

Further Scope

The current project has successfully built and evaluated a machine learning model to predict the category of resume it belongs to. However there is still room for improvement and further scope in this project which includes,

Imbalanced data: this dataset is not balanced since each class is not represented equally well. The data need to be applied with balancing techniques in order to get more accurate results
Hyperparameter tuning: Basic xgboost model has been implemented in this project. The ML model requires further hypertuning to improvise it
Model Comparison: In addition to the model evaluated in this project, other classification models could also be implemented and compared to identify the best performing model for this problem
Exploratory Data Analysis (EDA): In detail EDA is further required to the data in order to further understand and train the models to the data
Deployment: The project need to be deployed as a proper app in some open-source app framework like streamlit
NLP: Very basic NLP techniques has been applied in this project. Further deep techniques need to be applied in order to reduce the size of the data further down

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
resume_classifier.ipynb		resume_classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RESUME_CLASSIFIER

A Multi-Class Classification NLP Project

Introduction

Project Approach

1. Importing and Installing Necessary Libraries

2. Uploading and reading the csv file

3. Generic Cleaning of Data

4. Encoding and analysing distribution of target column

5. NLP steps to convert resume content to simplified text

6. Fit XGBoost model on vectorized data

Further Scope

About

Uh oh!

Releases

Packages

Languages

Anitha-K-0711/resume_classifier

Folders and files

Latest commit

History

Repository files navigation

RESUME_CLASSIFIER

A Multi-Class Classification NLP Project

Introduction

Project Approach

1. Importing and Installing Necessary Libraries

2. Uploading and reading the csv file

3. Generic Cleaning of Data

4. Encoding and analysing distribution of target column

5. NLP steps to convert resume content to simplified text

6. Fit XGBoost model on vectorized data

Further Scope

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages