Welcome to our submission for the Credit Underwriting Innovation Hackathon. Our system is designed to estimate income and repayment capability of financially underserved individuals using non-banking, privacy-compliant data sources. Built with an emphasis on modularity, explainability, and real-world deployability.
To predict an individual’s creditworthiness in the absence of formal income declarations by leveraging publicly available features and constructing a two-stage machine learning model.
- Goal: Predict
emi_to_income
as a proxy for income using non-sensitive, publicly observable features. - Model:
CatBoostRegressor
trained on synthetic EMI and alternate data. - Input Features: Region, employment patterns, loan counts, balance-to-limit ratios, etc.
- Goal: Use predicted
emi_to_income
+ engineered financial features to predicttarget_income
. - Model:
CatBoostRegressor
fine-tuned on aligned features with robust validation. - Safety: No use of personally identifiable or directly declared income data.
Install dependencies using:
pip install -r requirements.txt
A machine learning pipeline to estimate EMI-to-Income ratio and underwrite credit applicants using bureau data.
Developed during [Hackathon Name] by Team: Teen Titans
Lead Developer & Model Architect: Kanishka Kumar Singh
Credit_UnderWriting_Model/
├── data/
│ ├── bureau_data_10000_without_target.csv
│ ├── participant_col_mapping.csv
├── outputs/
│ └── final_predictions.csv
├── models/
│ ├── catboost_model.pkl
│ └── emi_to_income_proxy_model.pkl
├── scripts/
│ ├── train_proxy_model.py
│ ├── train_main_model.py
│ └── run_inference.py
├── requirements.txt
└── README.md
Model Stage | Metric | Value |
---|---|---|
Proxy Model | RMSE | ~0.00003 |
Final Model | R² Score | ~0.68–0.82 (varied) |
⚠️ Note: Accuracy degraded on blind test due to schema drift and noisy categorical encodings.
debt_to_credit
: Ratio of outstanding debt to credit limit.pin_region
: First two digits of pin code to map regional economics.count_features
: Total active accounts, limits, balances.
Used SHAP (SHapley Additive exPlanations) for interpreting model predictions.
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sample)
shap.summary_plot(shap_values, X_sample)
To generate predictions:
python scripts/run_inference.py
- Loads test CSV and applies preprocessing
- Predicts
emi_to_income
using proxy model - Applies imputations and final prediction
- Saves output to
outputs/final_predictions.csv
- ✅ No personal income data used
- ✅ No user-identifiable information
- ✅ Focused on transparency and fairness
- Kanishka Kumar Singh – Lead ML Developer & Model Architect
- Varun Kant - Frontend Developer
- Pranjal Agarwal - Lead Developer
- Naman Jaju - Backend Developer
- Organized by: [LenDenClub]
- Dataset provided by: [LenDenCLub]
- Special thanks to mentors, professors, and open-source contributors ❤️
- Data quality > model complexity
- Pipeline alignment & categorical consistency are critical
- SHAP is 🔥 for debugging blind spots
- And most importantly… sleep is underrated 😴
- All required packages listed in
requirements.txt
(e.g., CatBoost, SHAP, pandas, NumPy, etc.) - Trained models saved using
joblib
format in themodels/
directory