Predictive Analysis Using Social Profile in Online P2P Lending

1. Introduction

Prosper is a global market lending platform that has funded over $8 billion in loans. Prosper enables people to invest in one another in a financially and socially rewarding way. Borrowers list loan requests ranging from $2,000 to $35,000 on Prosper, and individual investors can invest as little as $25 in each loan listing they choose. Prosper manages loan servicing on behalf of the matched borrowers and investors. We will clean up the data and perform exploratory data analysis on the loan data in the following sections, using univariate, bivariate, and multivariate graphs and summaries. The analysis section highlights the most intriguing observations from the plots section. In Final Plots and Summary section, we will identify top three charts and provide final reflections regarding the dataset.

Individual consumers can borrow and lend money to one another directly through online peer-to-peer (P2P) lending platforms. In an experiment, we look at the borrowers, loans, and groups that influence performance predictability. by conceptualizing financial and social strength for the online P2P lending industry. forecast the borrower's rate and if the loan will be paid on time as a result of using a database of 9479 completed P2P transactions, we conducted an empirical investigation. Transactions in the calendar year 2007 support the suggested. This study used a conceptual model. The findings revealed that searching for financial files.P2P performance prediction can be improved by using social indicators. The loan market Although social factors influence borrowing rates, when compared to financial strength, the effects of and status are quite minimal.

What we Do:
Tasks: EDA (Exploratory Data Analysis):
1. Perform Data Exploration.
2. Data Cleaning.
3. Data Visualization and Manipulation.
4. Perform EDA on the file separately.
5. Trained the model using machine learning algorithm

Understanding the Dataset

The dataset under consideration contains information from loans taken out between 2005 and 2014. There are variables related to the borrower's loan history, credit history, occupation, income range, and so on.

A few of the dataset's columns are listed below:
ListingNumber
ListingCreationDate
CreditGrade
Term
LoanStatus
ClosedDate
BorrowerAPR
BorrowerRate
LenderYield
EstimatedEffectiveYield
EstimatedLoss
EstimatedReturn
ProsperRating

2. Exploratory data Analysis (EDA)

2.1 Data Wraglling or Data Cleaning

Let's start with the Introduction "According to the techtarget.com the data cleaning or data Wraglling or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them. Data cleansing improves data quality and helps provide more accurate, consistent and reliable information for decision-making in an organization.

A few of the dataset's columns are listed below:
ListingNumber
ListingCreationDate
CreditGrade
Term
LoanStatus
ClosedDate
BorrowerAPR
BorrowerRate
LenderYield
EstimatedEffectiveYield
EstimatedLoss
EstimatedReturn
ProsperRating

This data table stored as 'df' has 113937 rows and 81 columns. Following that, because we have 81 variables and some of the cells may have missing data, I will remove the columns with more than 80% NA's.
Cleaning the dataset

Our dataset contains null values mainly in the form of "?" however because pandas cannot read these values, we will convert them to np.nan form.

df.duplicated().sum()

for col in df.columns:
    df[col].replace({'?':np.NaN},inplace=True)

Now, that the data appears to be clean, let's plot some of the most important plotted graphs.

Univariate Plots Section

A univariate plot depicts and summarizes the data's distribution. Individual observations are displayed on a dot plot, also known as a strip plot. A box plot depicts the data in five numbers: minimum, first quartile, median, third quartile, and maximum.

Research Question 1 : What are the most number of borrowers Credit Grade?

Check the univariate relationship of Credit Grade

sns.set_style("whitegrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.countplot(y='CreditGrade',data=df)

Research Question 2 : Since there are so much low Credit Grade such as C and D , does it lead to a higher amount of deliquency?

Check the univariate relationship of Loan Status

df['LoanStatus'].hist(bins=100)

Research Question 3 : What is the highest number of BorrowerRate?

Check the univariate relationship of Borrower rate

df['BorrowerRate'].hist(bins=100)

Research Question 4 : Since the highest number of Borrower Rate is between 0.1 and 0.2, does the highest number of Lender Yield is between 0.1 and 0.2?

Check the univariate relationship of Lender Yield on Loan

 df['LenderYield'].hist(bins=100)

Bivariate Plots Section

Bivariate analysis is a type of statistical analysis in which two variables are compared to one another. One variable will be dependent, while the other will be independent. X and Y represent the variables. The differences between the two variables are analyzed to determine the extent to which the change has occurred.

Discuss some of the relationships you discovered during this phase of the investigation. What differences did the feature(s) of interest have with other features in the dataset?

My main point of interest was the borrower rate, which had a strong correlation with the Prosper Rating. Borrower rate increased linearly as rating decreased.Listed below are some good example of research questions that have been subjected to bivariate analysis.

Research Question 1 : Is the Credit Grade really accurate? Does higher Credit Grade leads to higher Monthly Loan Payment? As for Higher Credit Grade we mean from Grade AA to B.

Check the Bivariate Relationship between CreditGarde and MonthlyLoan Payment.

base_color = sns.color_palette()[3]
plt.figure(figsize = [20, 5])
plt.subplot(1, 2, 2)
sns.boxplot(data=df,x='CreditGrade',y='MonthlyLoanPayment',color=base_color);
plt.xlabel('CreditGrade');
plt.ylabel('Monthly Loan Payment');
plt.title(' Relationship between Creditgrade and MonthlyLoan Payment');

Research Question 2 : Here we look at the Completed Loan Status and Defaulted Rate to determine the accuracy of Credit Grade.

Check the Bivariate Relatonship between CreditGrade and LoanStatus

base_color = sns.color_palette()[3]
plt.figure(figsize = [20, 5])
plt.subplot(1, 2, 2)
sns.boxplot(data=df,x='CreditGrade',y='MonthlyLoanPayment',color=base_color);
plt.xlabel('CreditGrade');
plt.ylabel('Monthly Loan Payment');
plt.title(' Relationship between Creditgrade and MonthlyLoan Payment');

Multivariate Plots

Multivariate analysis is traditionally defined as the statistical study of experiments in which multiple measurements are taken on each experimental unit and the relationship between multivariate measurements and their structure is critical for understanding the experiment.
So, let's look at an example of a research question that we solved using multivariate plots and matplotlib.

Research Question 1 : Now we know the Credit Grade is accurate and is a tool that is used by the organization in determining the person’s creditworthiness. Now we need to understand does the ProsperScore, the custom built risk assesment system is being used in determing borrower’s rate?

Check the Multivariate Relationship between BorrowerRate and BorrowerAPR.

plt.figure(figsize = [20, 5])
plt.subplot(1, 2, 1)
plt.scatter(data=df,x='BorrowerRate',y='BorrowerAPR',color=base_color);
plt.xlabel('Borrower Rate');
plt.ylabel('Borrower APR');
plt.title('Relationship Between Borrower Rate and BorrowerAPR');

g = sb.FacetGrid(data = df, col = 'Term', height = 5,
                margin_titles = True)
g.map(plt.scatter, 'BorrowerRate', 'BorrowerAPR');
plt.colorbar()

![image](https://user-images.githubusercontent.com/88158022/185348841-d9b04...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!