Skip to content

HIRE FIT is a Big Data and Machine Learning-powered platform that automates resume screening and predicts candidate-job fit using Hadoop, Hive, Amazon S3, AWS SageMaker, and an XGBoost model trained on skill-based binary vectors. Built for efficient hiring at scale.

Notifications You must be signed in to change notification settings

liteshperumalla/Hirefit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hirefit

HIRE FIT: Resume-Job Matching Platform

HIRE FIT Banner

Build Status License: MIT Contributors

Project Overview

HIRE FIT is a Big Data and Machine Learning-driven application aimed at automating resume screening and matching the best candidates to job descriptions. By leveraging Hadoop, Hive, Amazon S3, and AWS SageMaker, we efficiently preprocess, store, query, and analyze large volumes of semi-structured data. A binary classification model using XGBoost predicts the job-resume matches based on skill overlap.


Table of Contents


Motivation

In today's fast-paced recruitment environment, finding the right candidate quickly is critical. By preprocessing resumes and job postings using Big Data technologies and applying machine learning models, we can significantly reduce hiring time and improve candidate-job fit.


Objectives

  • Preprocess and clean semi-structured resume and job posting datasets.
  • Structure the data using Hive and store in Amazon S3.
  • Build binary vectors representing skills.
  • Train an XGBoost classifier on AWS SageMaker to predict resume-job matches.
  • Evaluate and visualize the model's predictions.

Architecture

Hire Fit Architecture Image

(Diagram Flow):

  • Data Ingestion → Hadoop (Cleaning) → Hive (Structuring) → Amazon S3 (Storage) → AWS SageMaker (Feature Engineering + Model Training) → Predictions

Datasets

  • Source: Bright Data and Hugging Face.
  • Job Postings Dataset: Includes company names, job locations, and descriptions.
  • Resumes Dataset: Semi-structured resumes categorized by job domain.

Technology Stack

  • Big Data Tools: Hadoop, Hive
  • Cloud Services: Amazon S3, Amazon SageMaker, AWS EMR
  • Programming & Libraries: Java (Maven Project), Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)
  • Machine Learning: XGBoost
  • Visualization: Matplotlib, Seaborn

Workflow

  1. Data Preprocessing

    • Create Maven projects for both datasets.
    • Run Hadoop MapReduce jobs.
    • Clean, structure, and store the data.
  2. Data Transformation

    • Create Hive tables.
    • Perform queries: filter, aggregate, and standardize data.
  3. Storage

    • Save transformed datasets in Amazon S3.
  4. Feature Engineering

    • Generate binary skill vectors.
    • Calculate Cosine Similarity.
  5. Model Building

    • Train an XGBoost model on labeled pairs.
    • Optimize thresholds and handle data imbalance.
  6. Evaluation and Visualization

    • Analyze performance using confusion matrix, scatter plots, and boxplots.

How to Run

Prerequisites:

  • AWS account (S3, EMR, SageMaker access)
  • Python 3.8+
  • Java 8+
  • Maven installed

1. Preprocessing and Cleaning:

# Build Maven project
mvn clean install

# Submit Hadoop jobs
hadoop jar your-preprocessing-jar.jar input_path output_path

2. Structuring Data:

-- Create tables in Hive
CREATE TABLE resume (...);
CREATE TABLE job_listings (...);

-- Perform transformations
INSERT INTO standardized_job_listings SELECT ...;

3. Storage:

  • Save Hive outputs to Amazon S3.

4. Feature Extraction and Modeling:

# Load data
import pandas as pd
import xgboost as xgb

# Feature engineering
# Train XGBoost classifier

5. Visualization:

# Generate scatter plots, heatmaps, confusion matrices

Results and Visualizations

  • Cleaned datasets stored in Amazon S3.
  • Cosine Similarity vs Matching Score plots.
  • Top 10 matched candidates visualized.
  • Scatter plots for Resume vs Job Description Length.
  • Confusion Matrix with heatmap.
  • Boxplots showing Matching Score Distribution.

Challenges and Solutions

  • Data Preprocessing Complexity: Maven build and debug steps resolved JAR file issues.
  • Cluster Setup: Carefully configured EMR clusters.
  • Handling Data Imbalance: Applied resampling techniques.
  • Semantic Limitations: Future work includes semantic embeddings.

Contributors

Name Contributions
Arun Govindgari Data Preprocessing, Model Building
Chinmai Kaveti Data Preprocessing, Visualizations
Litesh Perumalla Data Preprocessing, Visualizations
Pavan Kalyan Natukula Model Building, Analysis, Documentation

All members contributed equally to the success of this project.


References


License

This project is licensed under the MIT License.


Thank you for exploring our project! Feel free to star ⭐ the repository if you find it useful!

![Thank You]

About

HIRE FIT is a Big Data and Machine Learning-powered platform that automates resume screening and predicts candidate-job fit using Hadoop, Hive, Amazon S3, AWS SageMaker, and an XGBoost model trained on skill-based binary vectors. Built for efficient hiring at scale.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published