This project predicts employee salaries based on features such as education level, gender, age, years of experience, and job title. It demonstrates a complete end-to-end machine learning pipeline using Python and scikit-learn.
The dataset contains the following columns:
Gender
Education Level
Job Title
Years of Experience
Age
Salary
(Target)
- Checked for missing values
- Dropped rows with <1% missing data
- Standardized inconsistent categorical values (e.g., “PhD” vs “phD”)
- Statistical summaries
- Distribution plots for numerical & categorical variables
- Identified outliers and data imbalance
OneHotEncoder
for categorical columnsStandardScaler
for numeric features (when required)- Combined using
ColumnTransformer
andPipeline
Tested multiple models:
- Linear Regression
- Random Forest Regressor
- Decision Tree Regressor
- ElasticNet
- KNN
- SVR
- XGBoost
Metrics used:
- MAE
- MSE
- RMSE
- R² Score
Cross-validation applied for robustness.
- Used
GridSearchCV
- Best model saved using
joblib
- Model simulation prediction