Credit Card Fraud Detection Using Machine Learning on Highly Unbalanced Data

Introduction

Credit card fraud detection is critical for financial security. This project explores the use of machine learning techniques to identify fraudulent transactions in highly imbalanced datasets. The dataset, provided by Worldline and the Université Libre de Bruxelles, includes 284,807 transactions, with only 0.172% labeled as fraudulent. This extreme imbalance poses significant challenges for model training and evaluation.

Through advanced techniques such as cost-sensitive learning, resampling methods, and ensemble classifiers, this project aims to address these challenges. Evaluation metrics focus on F1 Macro scores, Precision, Recall, and Precision-Recall Area Under Curve (PRAUC) to ensure robust performance.

Features

Dataset preprocessing: Handles PCA-transformed numerical features (V1 to V28) while retaining raw Time and Amount features for analysis.
Techniques for handling imbalanced data:
- Undersampling: Reduces the majority class to balance the dataset but risks losing critical information. RandomUnderSampler & ClusterCentroids
- Oversampling: Augments the minority class using synthetic techniques like SMOTE and ADASYN to enhance balance.
- Combination methods: SMOTEENN combines oversampling with noise reduction from undersampling.
- Cost-sensitive learning: Adjusts the model's focus on minimizing misclassification costs for the minority class.
Evaluation: Uses F1 Macro, Precision-Recall Curves, and PRAUC to assess performance under imbalance.

Installation

Clone the repository:

git clone https://github.com/mserra0/Credit-Card-Fraud-Detection.git
cd credit-card-fraud-detection

Install the required dependencies:
```
pip install -r requirements.txt
```

Methodology

Exploratory Data Analysis (EDA)

The project begins with a comprehensive analysis of the dataset:

Temporal Analysis: Identifies patterns in transaction timing that correlate with fraud.
Transaction Amounts: Investigates whether fraudulent transactions are associated with specific ranges of transaction amounts.
Outlier Analysis: Retains and studies outliers due to their relevance in fraud detection.

Techniques for Imbalanced Data

Undersampling:
- Methods such as RandomUnderSampler and Cluster Centroids are used to reduce the size of the majority class.
- Limitations: Loss of valuable information leads to reduced performance (e.g., F1 Macro of 0.537 with KNN and RandomUnderSampler).
Oversampling:
- SMOTE (Synthetic Minority Oversampling Technique): Creates synthetic examples by interpolating between nearest neighbors of the minority class.
- ADASYN (Adaptive Synthetic Sampling): Enhances SMOTE by generating more samples for harder-to-classify instances.
- Results: Significant performance improvement (F1 Macro of 0.924 with Random Forest and SMOTE).
Combination Sampling:
- SMOTEENN: Combines SMOTE with Edited Nearest Neighbors (ENN), which removes noise from the majority class.
- Results: Moderate success (F1 Macro of 0.91), though key fraudulent outliers may be removed, limiting further gains.
Cost-Sensitive Learning:
- Models like Random Forest, Logistic Regression, KNN, and XGBoost are modified to weigh the cost of misclassifying the minority class.
- A Voting Classifier, combining the best-performing models, yielded the highest performance:
  - F1 Macro: 0.938
  - PRAUC: 0.86
  - Accuracy: 99.9%

Experiments and Results

Undersampling: Effective for simplifying data but suffered from overfitting due to loss of information.
Oversampling: SMOTE and ADASYN demonstrated strong performance with Random Forest and XGBoost.
Combined Sampling: SMOTEENN showed promise but was less effective than standalone SMOTE in some cases.
Cost-Sensitive Learning: The Voting Classifier surpassed all other models with a recall of 0.785 and high precision (92.7%).

It is important to note that resampling the minority class before cross-validation can cause overfitting, as it risks creating synthetic or duplicate samples that overlap between training and validation sets. This undermines the integrity of cross-validation, leading to overly optimistic and misleading results. To avoid this, resampling should be performed within each cross-validation fold, ensuring synthetic data is generated exclusively from the training set without contaminating the validation set.

Image source: Marco Altini

Conclusions

The project demonstrates the importance of balancing techniques in fraud detection for imbalanced datasets. The Voting Classifier with cost-sensitive learning proved to be the most effective approach, achieving:

High precision and accuracy in identifying fraudulent transactions.
Robust performance across various metrics (F1 Macro: 0.938).

However, the recall metric remains an area for improvement, indicating that some fraud cases remain undetected. This highlights the complexity of detecting sophisticated fraud patterns, which may closely mimic legitimate transactions.

The study underscores the need for a careful trade-off between precision and recall to ensure both high detection rates and minimal false positives.

Future Directions

Model Specialization:
- Develop models tailored to specific fraud types.
- Integrate semi-supervised or unsupervised methods, such as autoencoders, to detect anomalies.
Enhancing Interpretability:
- Improve feature transparency for clearer insights into fraud detection.
Robustness Improvements:
- Expand experiments with different data augmentation and preprocessing techniques.
- Investigate adversarial training to counter sophisticated fraud techniques.

Contributors

Marc Serra Ortega - Universitat Autònoma de Barcelona (UAB)

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figs		figs
CreditCardDetection.pdf		CreditCardDetection.pdf
CreditFraud.ipynb		CreditFraud.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Credit Card Fraud Detection Using Machine Learning on Highly Unbalanced Data

Introduction

Table of Contents

Features

Installation

Methodology

Exploratory Data Analysis (EDA)

Techniques for Imbalanced Data

Experiments and Results

Conclusions

Future Directions

Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mserra0/Credit-Card-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection Using Machine Learning on Highly Unbalanced Data

Introduction

Table of Contents

Features

Installation

Methodology

Exploratory Data Analysis (EDA)

Techniques for Imbalanced Data

Experiments and Results

Conclusions

Future Directions

Contributors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages