AUC-ROC Score: 0.96037
Competition Link | GitHub Code
Objective: Predict loan approval probability (loan_status
) using financial and demographic features.
Challenge: Synthetic dataset mimicking real-world loan approval patterns while protecting test labels.
Evaluation: AUC-ROC (Area Under the ROC Curve), ideal for imbalanced binary classification.
-
Feature Engineering
- Created
loan_amnt_to_income
: Debt-to-income ratio
Why? Directly measures repayment capacity. - Added
emp_length_credit_ratio
: Employment vs credit history relationship
Why? Captures stability of financial profile. - Alternatives Considered:
- Income bucketing (rejected: loses granularity)
- Loan term features (not available in data)
- Created
-
Preprocessing
- Numerical Features:
- Median imputation (robust to outliers)
- Standard scaling (essential for gradient-based models)
- Categorical Features:
- Mode imputation (preserves category distribution)
- OneHot Encoding (avoid ordinal assumptions)
- Alternatives Rejected:
- Target encoding (risk of overfitting)
- KNN imputation (computationally expensive)
- Numerical Features:
-
Model Choice - XGBoost
- Handles mixed data types effectively
- Native support for missing values
- Robust to moderate overfitting
- Why Not Alternatives:
- Logistic Regression: Poor with non-linear relationships
- Random Forest: Less tunable than XGBoost
- Neural Networks: Overkill for tabular data
-
Hyperparameter Tuning
- RandomizedSearchCV: Sampled 10% of parameter space
Why? 5x faster than GridSearch with comparable results - StratifiedKFold: Maintains class balance in splits
- Key Parameters Tuned:
learning_rate
: Balances speed vs accuracycolsample_bytree
: Controls feature randomness
- RandomizedSearchCV: Sampled 10% of parameter space
-
Class Imbalance:
- Original dataset had ~30% rejection rate
- Addressed via stratified sampling, not SMOTE (preserved natural distribution)
-
Feature Importance:
- Top predictors:
loan_percent_income
loan_int_rate
person_income
- Engineered features ranked in top 10
- Top predictors:
-
Threshold Optimization:
- Default 0.5 threshold maintained
- Why? ROC analysis showed balanced TPR/FPR at this level
-
Environment Setup:
pip install pandas scikit-learn xgboost matplotlib seaborn
-
Data Preparation:
- Download train.csv and test.csv from Kaggle
- Place in project root directory
-
Run Model:
jupyter notebook loan_approval_prediction.ipynb
-
Expected Outputs:
- submission.csv: Final predictions
- Feature importance plots in notebook
Alternative Approach | Reason for Exclusion |
---|---|
CatBoost | Minimal AUC gain (<0.002) in validation |
Stacking Models | Complexity vs ROI analysis unfavorable |
Feature Selection | XGBoost's inherent selection sufficient |
Deep Learning | Limited data (~10k rows) |
Cost-Sensitive Learning | Class imbalance not severe enough |
- 0.96037 AUC: Top of competition entries
- Critical Success Factors:
- Debt-to-income ratio engineering
- Careful handling of missing values
- Learning rate annealing (0.01 → 0.1)
MIT License - Code free for academic/commercial use with attribution