HIRE FIT is a Big Data and Machine Learning-driven application aimed at automating resume screening and matching the best candidates to job descriptions. By leveraging Hadoop, Hive, Amazon S3, and AWS SageMaker, we efficiently preprocess, store, query, and analyze large volumes of semi-structured data. A binary classification model using XGBoost predicts the job-resume matches based on skill overlap.
- Motivation
- Objectives
- Architecture
- Datasets
- Technology Stack
- Workflow
- Implementation Details
- How to Run
- Results and Visualizations
- Challenges and Solutions
- Contributors
- References
In today's fast-paced recruitment environment, finding the right candidate quickly is critical. By preprocessing resumes and job postings using Big Data technologies and applying machine learning models, we can significantly reduce hiring time and improve candidate-job fit.
- Preprocess and clean semi-structured resume and job posting datasets.
- Structure the data using Hive and store in Amazon S3.
- Build binary vectors representing skills.
- Train an XGBoost classifier on AWS SageMaker to predict resume-job matches.
- Evaluate and visualize the model's predictions.
(Diagram Flow):
- Data Ingestion → Hadoop (Cleaning) → Hive (Structuring) → Amazon S3 (Storage) → AWS SageMaker (Feature Engineering + Model Training) → Predictions
- Source: Bright Data and Hugging Face.
- Job Postings Dataset: Includes company names, job locations, and descriptions.
- Resumes Dataset: Semi-structured resumes categorized by job domain.
- Big Data Tools: Hadoop, Hive
- Cloud Services: Amazon S3, Amazon SageMaker, AWS EMR
- Programming & Libraries: Java (Maven Project), Python (Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn)
- Machine Learning: XGBoost
- Visualization: Matplotlib, Seaborn
-
Data Preprocessing
- Create Maven projects for both datasets.
- Run Hadoop MapReduce jobs.
- Clean, structure, and store the data.
-
Data Transformation
- Create Hive tables.
- Perform queries: filter, aggregate, and standardize data.
-
Storage
- Save transformed datasets in Amazon S3.
-
Feature Engineering
- Generate binary skill vectors.
- Calculate Cosine Similarity.
-
Model Building
- Train an XGBoost model on labeled pairs.
- Optimize thresholds and handle data imbalance.
-
Evaluation and Visualization
- Analyze performance using confusion matrix, scatter plots, and boxplots.
Prerequisites:
- AWS account (S3, EMR, SageMaker access)
- Python 3.8+
- Java 8+
- Maven installed
1. Preprocessing and Cleaning:
# Build Maven project
mvn clean install
# Submit Hadoop jobs
hadoop jar your-preprocessing-jar.jar input_path output_path
2. Structuring Data:
-- Create tables in Hive
CREATE TABLE resume (...);
CREATE TABLE job_listings (...);
-- Perform transformations
INSERT INTO standardized_job_listings SELECT ...;
3. Storage:
- Save Hive outputs to Amazon S3.
4. Feature Extraction and Modeling:
# Load data
import pandas as pd
import xgboost as xgb
# Feature engineering
# Train XGBoost classifier
5. Visualization:
# Generate scatter plots, heatmaps, confusion matrices
- Cleaned datasets stored in Amazon S3.
- Cosine Similarity vs Matching Score plots.
- Top 10 matched candidates visualized.
- Scatter plots for Resume vs Job Description Length.
- Confusion Matrix with heatmap.
- Boxplots showing Matching Score Distribution.
- Data Preprocessing Complexity: Maven build and debug steps resolved JAR file issues.
- Cluster Setup: Carefully configured EMR clusters.
- Handling Data Imbalance: Applied resampling techniques.
- Semantic Limitations: Future work includes semantic embeddings.
Name | Contributions |
---|---|
Arun Govindgari | Data Preprocessing, Model Building |
Chinmai Kaveti | Data Preprocessing, Visualizations |
Litesh Perumalla | Data Preprocessing, Visualizations |
Pavan Kalyan Natukula | Model Building, Analysis, Documentation |
All members contributed equally to the success of this project.
- Apache Hadoop Documentation
- Apache Hive Documentation
- Amazon S3 Documentation
- AWS SageMaker Documentation
- XGBoost Documentation
This project is licensed under the MIT License.
Thank you for exploring our project! Feel free to star ⭐ the repository if you find it useful!
![Thank You]