Loan Default Prediction

Overview

Problem Statement:
Financial institutions face risks when granting personal loans, as there is always a possibility of default. This project aims to develop a machine learning model to predict loan default likelihood, enabling banks to make more informed lending decisions. The dataset includes customers' demographic details, previous loan history, and information on their repeat loans, for which we need to predict performance.

The project involves key steps such as data preprocessing, feature engineering, exploratory data analysis (EDA), feature selection, and model selection. Various classification models are implemented and optimized to identify the most suitable approach for accurate and reliable loan default prediction.

Usecases:
Banks and financial institutions rely on credit risk assessment to determine borrowers' likelihood of defaulting on a loan. An inaccurate assessment can lead to significant financial losses or missed lending opportunities. This project provides a machine learning-based loan default prediction system that helps lenders make data-driven decisions. By analyzing customer demographics, loan history, and repeat loan behaviours, the model identifies high-risk borrowers, enabling banks to take proactive measures such as adjusting interest rates, requiring additional guarantees, or rejecting risky applications. This improves risk management, enhances financial stability, and ensures fairer lending practices, benefiting both financial institutions and borrowers.

Data Source

The datasets used in this project are from Zindi's Loan Default Prediction Challenge, publicly available on Zindi and in the Data folder of this project. Three datasets will be used: demographic, previous loans, and performance data. The features in each dataset are shown in the ER diagram below. Note that all the datasets can be joined using the customerid feature.

Data Preprocessing

Given three separate datasets, the first step was merging, unifying, and creating a single dataset for building the loan default prediction model. This involved:

dealing with missing values and changing data types,
engineering new features,
aggregating certain features such as the number of loans, amount due, and loan term for each customer and,
finally, all three datasets are merged by customerid.

The first part (Data Ingestion) of the notebook.ipynb notebook focuses on data preprocessing.

Exploratory Data Analysis

Once the dataset is in good shape, the next step is EDA to ensure the quality of the data and extract preliminary insights into the problem. Below are plots that convey some insights from the EDA.

EDA Observations:

There is a class imbalance, with the majority(76%) of the instances in the dataset being good while a minority(24%) is bad.
With the class imbalance, metrics like AUC, recall, and F1 are more suitable.
Most loans are low amounts (10,000 naira) and for 30 days.
The distribution of most of the features is not Gaussian (age is the closest to having a Gaussian distribution).
For the categorical variables:
- Most applicants have savings(64%) accounts, a permanent employment(68%) and are from GT bank(37%).

The second part (Exploratory Data Analysis) of the notebook.ipynb notebook focuses on EDA.

Model Selection

For this project, nine regression algorithms were compared before selecting a final one. They are:

Logistic regression with lasso regularization
Support vector machine
Stochastic gradient descent
KNearest Neighbour
Naive Bayes
Decision Tree
Random Forest
Gradient Boosting
XGBoost

Test

Validation

Based on the AUC and Accuracy scores, the XGBoost Classifier was the best-performing algorithm and was therefore selected to build the final model. The code for the model selection is available in notebook.ipynb under the Model Selection section.

Model Information

Algorithm: XGBoost Classifier
Features: totaldue, termdays, loannumber_max, totaldue_min, termdays_min, first_payment_default_sum, longitude_gps, latitude_gps, age, is_GT Bank, is_First Bank, is_Access_Diamond, is_UBA, is_Zenith Bank, is_Permanent, is_Self-Employed, bank_account_num
Target Variable: good_bad_flag
Model Performance: AUC: 0.64 and Accuracy: 0.79

Technical Architecture

App Framework: Streamlit
ML Framework: Scikit-learn
Model: XGBoost Classifier (max_depth=5, n_estimators=400, min_child_weight: 4, gamma: 0.1)
Deployment: Streamlit
API Protocol: REST
Input/Output: JSON

Dependencies

Python 3.11
scikit-learn
numpy
pandas
Streamlit
plotly

Deployment

The usecase for this model focuses on existing customers with previous loan history and aggregating this previous loans data is part of the features used to train the model. So in other to test the model, what is needed is the customerid for which the test.py script will fetch the associated demographic and previous loan data, preprocess and then make a prediction. Find test customerid's in the testperf.csv in the data folder.

Local

Clone the repository:

git clone https://github.com/F-U-Njoku/loan-default-prediction
cd loan-default-prediction

Install dependencies using Pipenv:

pip install pipenv
pipenv install

Run the application:

pipenv run streamlit run app.py

Use the application:

Try some of these customer IDs:

8a858f5b5bee1b11015bf1b4ffea5abb
8a858f3e5885ffa301588ccdf1b437ef
8a858faf56b7821c0156cdaa248222fd
8a858f305c8dd672015c93b1db645db4
8a8589c253ace09b0153af6ba58f1f31
8a858e225a28c713015a30db5c48383d

Docker

Build the Docker image:

docker build -t loan-default-app .

Run the container:

docker run -p 8501:8501 loan-default-app

Use the application:

Now, visit http://localhost:8501 in your browser to access your Streamlit app.

Cloud

To use the app on the cloud, visit the Streamlit website, where the model is deployed and try the above sample customer IDs.

Contributing

I welcome contributions to the Loan Default Application project! Here's how you can help:

Fork the repository
Create a new branch (git checkout -b feature/AmazingFeature)
Make your changes
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

Acknowledgements

Special thanks to the Datatalks club for providing a practical and free course on Machine Learning. Gratitude to Alexey and the entire team for their efforts.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

LinkedIn: Uchechukwu Njoku

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
images		images
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
app.py		app.py
model_5_4_100_01.bin		model_5_4_100_01.bin
notebook.ipynb		notebook.ipynb
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Loan Default Prediction

Table of Contents

Overview

Data Source

Data Preprocessing

Exploratory Data Analysis

EDA Observations:

Model Selection

Test

Validation

Model Information

Technical Architecture

Dependencies

Deployment

Local

Docker

Cloud

Contributing

Acknowledgements

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

F-U-Njoku/loan-default-prediction

Folders and files

Latest commit

History

Repository files navigation

Loan Default Prediction

Table of Contents

Overview

Data Source

Data Preprocessing

Exploratory Data Analysis

EDA Observations:

Model Selection

Test

Validation

Model Information

Technical Architecture

Dependencies

Deployment

Local

Docker

Cloud

Contributing

Acknowledgements

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages