Final Project - Fraud Detection and Prevention Model

Project Description

Objective

In this project, I am going to build a classification model with Random Forest in order to detect and prevent fraud. The data I am going to use to train the model came from fictional clients of an E-commerce company.
The input data that will be passed to the model in order to make predictions are:

orderAmount: Float
orderState: String
paymentMethodRegistrationFailure: String
paymentMethodType: String
paymentMethodProvider: String
paymentMethodIssuer: String
transactionAmount: Integer
transactionFailed: Boolean
emailDomain: String
emailProvider: String
customerIPAddressSimplified: String
sameCity: String

And the model is going to return one of the following predictions:

No
Sí
Warning

Process

We are going to receive a dataset in JSON format, which we need to transform into a CSV format. After performing this transformation, we can observe certain columns that should be deleted or manipulated. For example, we should remove the IDs , and for the CustomerEmail column, we need to keep the most important/common values and classify rare values under a new label called "weird".

After this data adaptation and manipulation process, we proceed with an Exploratory Data Analysis (EDA), where we conduct univariate analysis, bivariate analysis and correlations for specific variables that stood out to us. Once the EDA was done, I highlighted specific insights I found.

After completing the EDA, we prepare the data by discretizing variables, handling missing values, and interpretating and modifying certain variables.

Before training, we select specific columns from the post-processed dataset, and normalize their values. During training, we are going to train two Clustering models with two different algorithms: One model will use the K-Means algorithm while the other one will use HDBSCAN. Once trained, we document the insights discovered through a coordinate plot for each model.

To finalize with the modeling process, we will create a classification model with Random Forest. Once the model was trained with specific parameters, we generate a confusion matrix in order to observe the number of correct and incorrect predictions made by the model, organized by class.

Moving on to the API´s development, we create an API that receives specific input data, it transforms them (discretization, one hot encoding), and returns the prediction made by model based on that input. Then we containerize the API using Docker. Finally, we host it on Microsoft Azure.

Finally, we create with Gradio a Graphical User Interface and we deploy it in Hugging Face Spaces. The URL to test the interactive app is: https://huggingface.co/spaces/Itrs/Proyecto_Final

Methods Used

Data Cleaning and Transformation (Data Wrangling)
Exploratory Data Analysis (EDA)
Data Visualization
Data Preparation
Model Training (Clustering and Classification)
Model Evaluation
Model Deployment with an API, Docker y Microsoft Azure
Graphical User Interface with Gradio

Technologies and Tools used

1. Cleaning, Transformation and Data Preparation

Pandas: Load and prepare data, efficiently manipulate datasets, transform a dataset from JSON format to CSV format, perform descriptive analysis.

2. Data Visualization / EDA

Funpymodeling: Observe data distribution, the number of unique values and their occurrences, standard deviation, percentage of missing values, correlation between variables, among other information.
Seaborn y Matplotlib: Tools for data visualization and creation of statistical plots (Correlation, scatter plots, confusion matrix).
Minepy (MINE): Identify non-linear and complex relationships between variables in datasets using the algorithm MINE (Maximal Information-based Nonparametric Exploration).
YellowBrick (Cluster.KelbowVisualizer): Visualization of the Elbow Method in clustering algorithms.
Plotly (Scatter3D): Visualization of Interactive 3d Scatter Plots for clusters.

3. Modeling

Hdbscan: HDBSCAN algorithm (Clustering Model).
Scikit-Learn
- KMeans: K-Means algorithm (Clustering Model).
- RandomForestClassifier: Random Forest algorithm (Classification Model).
- ConfusionMatrixDisplay: Visualize Confusion Matrix (Classification Model).
Mlflow: Management and registration of model experiments, allowing us to track parameters, metrics and versions.

4. API

FastAPI: API creation in order to expose the trained model.
Uvicorn: Light and fast ASGI server that allows the asyncronous execution of the API.
Requests: Make HTTP requests to the API.
Pydantic: Validate the input data of the API.
Gradio: Graphical User Interface that allows the users to test the model.

5. Deployment

Docker: Containerize the API and simplify its deployment.
Hugging Face Space: Host and share the model in a centralized way.
Microsoft Azure: Deployment platform in order to host the API.

Programming Language

Python: Main language used for the development of the project, compatible with the libraries of data science and machine learning.

Installation

Docker

Build image:
docker build -t proyecto_final .
Build container:
docker run -p 7860:7860 -e ID_USER=Iñaki proyecto_final

Deploy Docker Hub + Web App in Microsoft Azure

To keep in mind

If we want to run mlflow ui, we must be inside the deployment/ folder, which is where the mlruns/ folder is located.
The insights made in the notebooks are in Spanish.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
deployment		deployment
imgs		imgs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Final Project - Fraud Detection and Prevention Model

Project Description

Objective

Process

Methods Used

Technologies and Tools used

1. Cleaning, Transformation and Data Preparation

2. Data Visualization / EDA

3. Modeling

4. API

5. Deployment

Programming Language

Installation

Docker

Deploy Docker Hub + Web App in Microsoft Azure

To keep in mind

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ITRoselloSignoris/Fraud-Detection-and-Prevention-Model

Folders and files

Latest commit

History

Repository files navigation

Final Project - Fraud Detection and Prevention Model

Project Description

Objective

Process

Methods Used

Technologies and Tools used

1. Cleaning, Transformation and Data Preparation

2. Data Visualization / EDA

3. Modeling

4. API

5. Deployment

Programming Language

Installation

Docker

Deploy Docker Hub + Web App in Microsoft Azure

To keep in mind

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages