Hirefit

HIRE FIT: Resume-Job Matching Platform

Project Overview

HIRE FIT is a Big Data and Machine Learning-driven application aimed at automating resume screening and matching the best candidates to job descriptions. By leveraging Hadoop, Hive, Amazon S3, and AWS SageMaker, we efficiently preprocess, store, query, and analyze large volumes of semi-structured data. A binary classification model using XGBoost predicts the job-resume matches based on skill overlap.

Motivation

In today's fast-paced recruitment environment, finding the right candidate quickly is critical. By preprocessing resumes and job postings using Big Data technologies and applying machine learning models, we can significantly reduce hiring time and improve candidate-job fit.

Objectives

Preprocess and clean semi-structured resume and job posting datasets.
Structure the data using Hive and store in Amazon S3.
Build binary vectors representing skills.
Train an XGBoost classifier on AWS SageMaker to predict resume-job matches.
Evaluate and visualize the model's predictions.

Architecture

(Diagram Flow):

Data Ingestion → Hadoop (Cleaning) → Hive (Structuring) → Amazon S3 (Storage) → AWS SageMaker (Feature Engineering + Model Training) → Predictions

Datasets

Source: Bright Data and Hugging Face.
Job Postings Dataset: Includes company names, job locations, and descriptions.
Resumes Dataset: Semi-structured resumes categorized by job domain.

Technology Stack

Big Data Tools: Hadoop, Hive
Cloud Services: Amazon S3, Amazon SageMaker, AWS EMR
Programming & Libraries: Java (Maven Project), Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)
Machine Learning: XGBoost
Visualization: Matplotlib, Seaborn

Workflow

Data Preprocessing
- Create Maven projects for both datasets.
- Run Hadoop MapReduce jobs.
- Clean, structure, and store the data.
Data Transformation
- Create Hive tables.
- Perform queries: filter, aggregate, and standardize data.
Storage
- Save transformed datasets in Amazon S3.
Feature Engineering
- Generate binary skill vectors.
- Calculate Cosine Similarity.
Model Building
- Train an XGBoost model on labeled pairs.
- Optimize thresholds and handle data imbalance.
Evaluation and Visualization
- Analyze performance using confusion matrix, scatter plots, and boxplots.

How to Run

Prerequisites:

AWS account (S3, EMR, SageMaker access)

Python 3.8+

Java 8+

Maven installed

1. Preprocessing and Cleaning:

# Build Maven project
mvn clean install

# Submit Hadoop jobs
hadoop jar your-preprocessing-jar.jar input_path output_path

2. Structuring Data:

-- Create tables in Hive
CREATE TABLE resume (...);
CREATE TABLE job_listings (...);

-- Perform transformations
INSERT INTO standardized_job_listings SELECT ...;

3. Storage:

Save Hive outputs to Amazon S3.

4. Feature Extraction and Modeling:

# Load data
import pandas as pd
import xgboost as xgb

# Feature engineering
# Train XGBoost classifier

5. Visualization:

# Generate scatter plots, heatmaps, confusion matrices

Results and Visualizations

Cleaned datasets stored in Amazon S3.
Cosine Similarity vs Matching Score plots.
Top 10 matched candidates visualized.
Scatter plots for Resume vs Job Description Length.
Confusion Matrix with heatmap.
Boxplots showing Matching Score Distribution.

Challenges and Solutions

Data Preprocessing Complexity: Maven build and debug steps resolved JAR file issues.
Cluster Setup: Carefully configured EMR clusters.
Handling Data Imbalance: Applied resampling techniques.
Semantic Limitations: Future work includes semantic embeddings.

Contributors

Name	Contributions
Arun Govindgari	Data Preprocessing, Model Building
Chinmai Kaveti	Data Preprocessing, Visualizations
Litesh Perumalla	Data Preprocessing, Visualizations
Pavan Kalyan Natukula	Model Building, Analysis, Documentation

All members contributed equally to the success of this project.

References

License

This project is licensed under the MIT License.

Thank you for exploring our project! Feel free to star ⭐ the repository if you find it useful!

![Thank You]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.DS_Store		.DS_Store
000000_0-hadoop_20250403002830_6282c884-4a87-40ea-9012-c16d3aff775f-1		000000_0-hadoop_20250403002830_6282c884-4a87-40ea-9012-c16d3aff775f-1
000000_0-hadoop_20250403003123_1681fb76-9adb-4370-9a1c-5ab7d1d925d9-1		000000_0-hadoop_20250403003123_1681fb76-9adb-4370-9a1c-5ab7d1d925d9-1
000000_0-hadoop_20250403003349_771a5ae3-950d-4d65-8bc3-e0983230a615-1		000000_0-hadoop_20250403003349_771a5ae3-950d-4d65-8bc3-e0983230a615-1
000000_0-hadoop_20250403003520_0f4c9845-ea82-4b0d-8b20-0afb65a392ca-1		000000_0-hadoop_20250403003520_0f4c9845-ea82-4b0d-8b20-0afb65a392ca-1
000000_0-hadoop_20250403011440_9f04e66f-5b4f-4a4a-9eec-883d8ed76fb2-1		000000_0-hadoop_20250403011440_9f04e66f-5b4f-4a4a-9eec-883d8ed76fb2-1
CSCE_5300_project_proposal.docx		CSCE_5300_project_proposal.docx
Ex2-2.java		Ex2-2.java
README.md		README.md
UpdatedResumeDataSet_M.csv		UpdatedResumeDataSet_M.csv
even.java		even.java
hire_fit (1).ipynb		hire_fit (1).ipynb
joblisting-0.0.1-SNAPSHOT.jar		joblisting-0.0.1-SNAPSHOT.jar
pyspark_python (1).ipynb		pyspark_python (1).ipynb
resumeproject-0.0.1-SNAPSHOT.jar		resumeproject-0.0.1-SNAPSHOT.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hirefit

HIRE FIT: Resume-Job Matching Platform

Project Overview

Table of Contents

Motivation

Objectives

Architecture

Datasets

Technology Stack

Workflow

How to Run

Results and Visualizations

Challenges and Solutions

Contributors

References

License

About

Uh oh!

Releases

Packages

Languages

liteshperumalla/Hirefit

Folders and files

Latest commit

History

Repository files navigation

Hirefit

HIRE FIT: Resume-Job Matching Platform

Project Overview

Table of Contents

Motivation

Objectives

Architecture

Datasets

Technology Stack

Workflow

How to Run

Results and Visualizations

Challenges and Solutions

Contributors

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages