- Overview
- Data Source
- Data Preprocessing
- Exploratory Data Analysis
- Model Selection
- Technical Architecture
- Dependencies
- Deployment
- Contributing
- Acknowledgements
- License
- Contact
Problem Statement:
Financial institutions face risks when granting personal loans, as there is always a possibility of default. This project aims to develop a machine learning model to predict loan default likelihood, enabling banks to make more informed lending decisions. The dataset includes customers' demographic details, previous loan history, and information on their repeat loans, for which we need to predict performance.
The project involves key steps such as data preprocessing, feature engineering, exploratory data analysis (EDA), feature selection, and model selection. Various classification models are implemented and optimized to identify the most suitable approach for accurate and reliable loan default prediction.
Usecases:
Banks and financial institutions rely on credit risk assessment to determine borrowers' likelihood of defaulting on a loan. An inaccurate assessment can lead to significant financial losses or missed lending opportunities. This project provides a machine learning-based loan default prediction system that helps lenders make data-driven decisions. By analyzing customer demographics, loan history, and repeat loan behaviours, the model identifies high-risk borrowers, enabling banks to take proactive measures such as adjusting interest rates, requiring additional guarantees, or rejecting risky applications. This improves risk management, enhances financial stability, and ensures fairer lending practices, benefiting both financial institutions and borrowers.
The datasets used in this project are from Zindi's Loan Default Prediction Challenge, publicly available on Zindi and in the Data
folder of this project. Three datasets will be used: demographic, previous loans, and performance data. The features in each dataset are shown in the ER diagram below. Note that all the datasets can be joined using the customerid
feature.
Given three separate datasets, the first step was merging, unifying, and creating a single dataset for building the loan default prediction model. This involved:
- dealing with missing values and changing data types,
- engineering new features,
- aggregating certain features such as the number of loans, amount due, and loan term for each customer and,
- finally, all three datasets are merged by
customerid
.
The first part (Data Ingestion) of the notebook.ipynb
notebook focuses on data preprocessing.
Once the dataset is in good shape, the next step is EDA to ensure the quality of the data and extract preliminary insights into the problem. Below are plots that convey some insights from the EDA.
- There is a class imbalance, with the majority(76%) of the instances in the dataset being good while a minority(24%) is bad.
- With the class imbalance, metrics like AUC, recall, and F1 are more suitable.
- Most loans are low amounts (10,000 naira) and for 30 days.
- The distribution of most of the features is not Gaussian (age is the closest to having a Gaussian distribution).
- For the categorical variables:
- Most applicants have savings(64%) accounts, a permanent employment(68%) and are from GT bank(37%).
The second part (Exploratory Data Analysis) of the notebook.ipynb
notebook focuses on EDA.
For this project, nine regression algorithms were compared before selecting a final one. They are:
- Logistic regression with lasso regularization
- Support vector machine
- Stochastic gradient descent
- KNearest Neighbour
- Naive Bayes
- Decision Tree
- Random Forest
- Gradient Boosting
- XGBoost
Based on the AUC and Accuracy scores, the XGBoost Classifier was the best-performing algorithm and was therefore selected to build the final model. The code for the model selection is available in notebook.ipynb
under the Model Selection section.
- Algorithm: XGBoost Classifier
- Features:
totaldue, termdays, loannumber_max, totaldue_min, termdays_min, first_payment_default_sum, longitude_gps, latitude_gps, age, is_GT Bank, is_First Bank, is_Access_Diamond, is_UBA, is_Zenith Bank, is_Permanent, is_Self-Employed, bank_account_num
- Target Variable:
good_bad_flag
- Model Performance:
AUC
: 0.64 andAccuracy
: 0.79
- App Framework: Streamlit
- ML Framework: Scikit-learn
- Model: XGBoost Classifier (
max_depth
=5,n_estimators
=400,min_child_weight
: 4,gamma
: 0.1) - Deployment: Streamlit
- API Protocol: REST
- Input/Output: JSON
- Python 3.11
- scikit-learn
- numpy
- pandas
- Streamlit
- plotly
The usecase for this model focuses on existing customers with previous loan history and aggregating this previous loans data is part of the features used to train the model. So in other to test the model, what is needed is the customerid
for which the test.py
script will fetch the associated demographic and previous loan data, preprocess and then make a prediction. Find test customerid's in the testperf.csv
in the data folder.
- Clone the repository:
git clone https://github.com/F-U-Njoku/loan-default-prediction
cd loan-default-prediction
- Install dependencies using Pipenv:
pip install pipenv
pipenv install
- Run the application:
pipenv run streamlit run app.py
- Use the application:
Try some of these customer IDs:
8a858f5b5bee1b11015bf1b4ffea5abb
8a858f3e5885ffa301588ccdf1b437ef
8a858faf56b7821c0156cdaa248222fd
8a858f305c8dd672015c93b1db645db4
8a8589c253ace09b0153af6ba58f1f31
8a858e225a28c713015a30db5c48383d
- Build the Docker image:
docker build -t loan-default-app .
- Run the container:
docker run -p 8501:8501 loan-default-app
- Use the application:
Now, visit http://localhost:8501
in your browser to access your Streamlit app.
To use the app on the cloud, visit the Streamlit website, where the model is deployed and try the above sample customer IDs.
I welcome contributions to the Loan Default Application project! Here's how you can help:
- Fork the repository
- Create a new branch (
git checkout -b feature/AmazingFeature
) - Make your changes
- Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Please ensure your code adheres to the project's coding standards and includes appropriate tests.
Special thanks to the Datatalks club for providing a practical and free course on Machine Learning. Gratitude to Alexey and the entire team for their efforts.
This project is licensed under the MIT License - see the LICENSE file for details.
- LinkedIn: Uchechukwu Njoku