Fraud Detection

Objective of the Project

The goal of the project is building a model using Machine learning that detects financial fraud by identifying suspicious activities.

Dataset Overview:

Fraud Detection Dataset
The dataset is divided into two parts: train and test.
It contains information about financial transactions, with a column indicating whether the transaction was fraudulent (target variable).

Key Variables:

index: Unique Identifier for each row
trans_date_trans_time: Transaction DateTime
cc_num: Credit Card Number of Customer
merchant: Merchant Name
category: Category of Merchant
amt: Amount of Transaction
first: First Name of Credit Card Holder
last: Last Name of Credit Card Holder
gender: Gender of Credit Card Holder
street: Street Address of Credit Card Holder
city: City of Credit Card Holder
state: State of Credit Card Holder
zip: Zip of Credit Card Holder
lat: Latitude Location of Credit Card Holder
long: Longitude Location of Credit Card Holder
city_pop: Credit Card Holder's City Population
job: Job of Credit Card Holder
dob: Date of Birth of Credit Card Holder
trans_num: Transaction Number
unix_time: UNIX Time of transaction
merch_lat: Latitude Location of Merchant
merch_long: Longitude Location of Merchant
is_fraud: Fraud Flag <--- Target Class

Challenges & Solutions

Multicollinearity between features - Solution: Addressed multicollinearity by combining features and eliminating unnecessary ones.
Feature Selection - Solution: Kept features like ZIP, latitude, and longitude in the model, enhancing the feature engineering process for better fraud detection.

Methodology

ETL: Data Extraction & Cleaning

Created a class named DataLoader that loads the training and test datasets from CSV files.
Combined both datasets into one DataFrame (combined_train_test_data).
Created a class named DataExplorer that performs initial exploration of the combined data, displaying: the shape and columns of the dataset, Data types of each column and basic information about the DataFrame and the first five rows for a quick overview.
Created a class named DataLoader that renames the columns for lowercase, while dealing with missing values and duplicates and dropped unnecessary columns.

Exploratory Data Analysis, Data Preprocessing & Visualizations

Created a class named EDA that provides an overview of the dataset, including descriptive statistics and data visualizations.
- Statistical summary: Computes and returns a statistical summary for numerical columns, including measures of central tendency (mean, median) and dispersion (variance, standard deviation, interquartile range (IQR), and outlier counts).
- Data Distribution Visualizations: Distribution of numerical features (excluding 'is_fraud') using histograms with KDE plots and distribution of the target variable 'is_fraud' using a count plot to compare fraudulent and non-fraudulent transactions.
- Inferential Statistics: The Chi-square test was applied to each categorical feature against the target variable (is_fraud) to check the association with the target, categorical variables transformed for numeric and added to the correlation matrix.

Feature Engineering

Created new features to the dataset to help address multicollinearity between existing features (time-related features, transaction amount features, and calculates the distance to the merchant).
Additionally, features like time differences, small city indicators, and combined categories were generated, providing more diverse information, reducing the correlation between the original features and improving model performance.

Modelling

Train-Test Split: Divided the data into 70% for training and 30% for testing.
Scaling: Applied MinMaxScaler to normalize the numeric features in the training and test sets, scaling values between 0 and 1.
SMOTE (Over-sampling Technique): Balanced the dataset by generating synthetic samples for the minority class (fraud) in the training data.

Models used:

Logistic Regression: Performs well in predicting non-fraud cases with perfect precision but misses 31% of actual non-fraud cases (recall 0.69). However, it struggles with fraud detection, showing extremely low precision (0.01) and F1-score (0.02), making it unreliable for detecting fraud.
Random Forest: Performs well on non-fraud cases with near-perfect precision and recall (0.99). While fraud detection improves over Logistic Regression, with 79% recall and a higher F1-score (0.51), its precision (0.38) remains low, limiting its reliability for fraud detection.
XGBoost (model chosen): The model performs well in detecting non-fraudulent transactions, with 100% precision and recall. However, it only detects 81% of fraudulent transactions, missing 19% and having a 14% false positive rate. The high accuracy alone is not a reliable metric for this model due to the imbalanced data between the classes, where non-fraudulent transactions dominate.

Data Visualizations

Feature Importance Plot: Used to check the impact on the model's decisions for refining and guiding feature engineering.
Confusion Matrix plot: The confusion matrix was important for evaluating the model's performance, particularly in identifying false positives.
ROC Curve plot: We concluded that the model is good at distinguishing between fraudulent and non-fraudulent transactions (performing with very few mistakes) and can correctly classify +90% of the cases based on its predictions.

Next Steps

Review the specific false positives and false negatives to understand the model’s mistakes and improve data preprocessing.

Deliverables

Jupyter Notebook with Full Code in Python: Contains all steps from data preprocessing to model evaluation.
PowerPoint Presentation: A concise overview of the project, including methodology, results, and key findings, supported by visual plots.
Power BI Dashboard: An interactive visualization of key metrics and model performance, featuring insights from the table with real and predicted values to explore patterns, trends, and key insights.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Fraud Dectection Dashboard.pdf		Fraud Dectection Dashboard.pdf
Fraud Presentation.pdf		Fraud Presentation.pdf
README.md		README.md
code.ipynb		code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fraud Detection

Objective of the Project

Dataset Overview:

Key Variables:

Challenges & Solutions

Methodology

ETL: Data Extraction & Cleaning

Exploratory Data Analysis, Data Preprocessing & Visualizations

Feature Engineering

Modelling

Data Visualizations

Next Steps

Deliverables

About

Uh oh!

Releases

Packages

Languages

Project-Highlights/fraud-detection-project-ml

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection

Objective of the Project

Dataset Overview:

Key Variables:

Challenges & Solutions

Methodology

ETL: Data Extraction & Cleaning

Exploratory Data Analysis, Data Preprocessing & Visualizations

Feature Engineering

Modelling

Data Visualizations

Next Steps

Deliverables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages