The objective of this project is to build a machine learning model that predicts whether a job applicant is suitable for a particular role based on their background. This supports recruitment platforms and HR tech firms in streamlining candidate screening, reducing manual effort, and improving the quality of shortlisted applicants.
This analysis explores a dataset of job applicants to classify them as "suitable" or "not suitable" for a job role. The dataset, obtained from Kaggle, includes a variety of candidate attributes such as education, major, years of experience, and company background. The project applies several classification algorithms to create a predictive model that can assist hiring platforms and HR systems in making faster, data-driven decisions.
- Data cleaning and preprocessing
- Exploratory data analysis (EDA) to identify patterns
- Classification modeling and evaluation
- Hyperparameter tuning and feature importance analysis
Recruiters are overwhelmed by large volumes of job applications, making manual resume screening inefficient, inconsistent, and prone to bias. Companies need a solution to:
- Automate early-stage screening
- Identify top candidates faster
- Reduce time-to-hire and hiring costs
This project seeks to answer:
- Can we predict applicant suitability reliably?
- Which candidate features are most predictive?
- Which classification model performs best for this task?
The dataset is sourced from Kaggle - HR Analytics Job Change of Data Scientists and includes:
- Features: Education level, major, years of experience, gender, company type, university enrollment, etc.
- Target: Suitability (1 = Suitable, 0 = Not Suitable)
- Combined training and test sets for exploratory analysis
- Handled missing values and performed categorical encoding
- Balanced the dataset to account for class imbalance
We applied both descriptive analysis and classification modeling.
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- Accuracy
- Precision
- Recall
- F1-Score
- ROC-AUC Score
Model | Accuracy | Precision (Class 1) | Recall (Class 1) | ROC-AUC |
---|---|---|---|---|
Logistic Regression | 75.2% | 53% | 36% | 0.70 |
Decision Tree | 71.8% | 49% | 38% | 0.66 |
Random Forest | 76.3% | 54% | 39% | 0.71 |
Gradient Boosting | 77.1% | 56% | 40% | 0.72 |
Applicants with relevant experience are significantly more likely to be labeled suitable.
Clear class imbalance exists — fewer candidates are marked suitable.
Those enrolled in full-time programs show higher suitability scores.
Suitable applicants mostly come from private companies; missing data correlates with unsuitability.
-
Feature Importance
- Relevant experience, education, and company background are strong predictors of suitability.
-
Model Effectiveness
- Gradient Boosting performed best overall. It had the highest accuracy and balanced precision-recall for suitable applicants.
-
Business Value
- This model can help automate early-stage applicant filtering, saving time and improving shortlisting quality.
- Integrate the Gradient Boosting model into recruitment platforms for initial screening.
- Use key features (experience, education, company type) to prioritize applicants.
- Periodically retrain the model with new data to adapt to changing hiring trends.
- Include human oversight for final candidate decisions to avoid missing qualified applicants.
- Fairness & Bias Testing: Evaluate the model's performance across gender and university tiers.
- Model Deployment: Build an applicant scoring dashboard.
- Feature Expansion: Include skill tags or resume text (via NLP) to improve accuracy.
- Hyperparameter Tuning: Further refine Gradient Boosting for optimal performance.