This document serves as a comprehensive guide for the credit default prediction model developed using Machine Learning (ML) techniques and visualized through Power BI. The model aims to predict the likelihood of loan defaults based on various borrower attributes and loan characteristics, providing valuable insights for financial institutions. Credit default prediction is a critical task in the financial industry. Acurate prediction of credit defaults enables financial institutions to minimize losses, optimize lending decisions, and improve overall portfolio risk management.
The primary objective of this project is to develop a robust predictive model that can accurately assess the risk of loan defaults. This is achieved by utilizing ML algorithms, specifically Logistic Regression and Random Forest, to analyze historical data. The results are then visualized in Power BI to facilitate decision-making processes for stakeholders.
Loading the Dataset: The dataset was collected from the Kaggle Website. Reading the Dataset: The dataset was read into a Pandas DataFrame for further analysis and manipulation.
Dataset Link - https://www.kaggle.com/datasets/laotse/credit-risk-dataset/data
Dropping Data: Unnecessary or irrelevant rows are removed based on specific criteria.
Handling Missing Values: Missing values in numerical columns are filled with their respective mean values.
Categorical to Numerical Encoding:Categorical variables are transformed into numerical format using encoding techniques like one-hot encoding.
Feature Engineering: Relevant features are selected and concatenated with the dataset for further analysis.
The dataset was split into training and testing sets using a 80:20 ratio. The training set was used to train the machine learning models, while the testing set was used to evaluate their performance on unseen data.
Data scaling is performed to normalize feature values, ensuring that the model trains effectively without bias towards any specific feature due to its scale. The model trains effectively and prevents features with larger scales from dominating the learning process.
Logistic Regression :
A Logistic Regression model was trained on the training data. The models are trained on the scaled training data. The trained model was used to predict the probability of default for the test data. Accuracy, precision, recall and F1 score are calculated using the test data to assess model performance.
Random Forest :
A Random Forest model was trained on the training data. The models are trained on the scaled training data. The trained model was used to predict the probability of default for the test data. Accuracy, precision, recall and F1 score are calculated using the test data to assess model performance.
Feature importance was analyzed to identify the most influential factors contributing to credit default predictions. A new feature is created containing true indices alongside predicted values for analysis.
The final datasets with predicted probabilities, actual labels, and features was saved as an Excel file.
The saved Excel file is imported into Power BI to create a visual dashboard that includes:
Accuracy Metrics: Displaying model accuracy scores.
Loan Status Scores: Visual representation of predicted loan statuses.
Charts: Various charts illustrating key insights derived from the model predictions.
This comprehensive approach ensures that stakeholders can visualize and interpret the credit risk assessment effectively, aiding in informed decision-making regarding loan approvals and risk management strategies.
The project successfully demonstrated the development and evaluation of machine learning models for credit default prediction. By combining Python for model training and Power BI for visualization and analysis, I gained valuable insights into the factors driving credit risk. The findings from this project can inform lending decisions, optimize risk assessment processes, and ultimately enhance the profitability and stability of financial institutions.