Skip to content

A machine learning system to detect fraudulent credit card transactions using Random Forest and Neural Networks. Addresses class imbalance challenges with SMOTE and optimized for real-world application in financial security.

Notifications You must be signed in to change notification settings

thisissophiawang/credit-card-fraud-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

38 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Credit Card Fraud Detection

๐Ÿ—‚๏ธ Table of Contents


๐Ÿ“Œ Project Overview

This project implements advanced data mining techniques to detect fraudulent credit card transactions.

With global fraud losses exceeding $28.65 billion annually, we aim to tackle key challenges such as imbalanced data, computational efficiency, and evolving fraud patterns.


๐Ÿ” Key Challenges Addressed

  • Severe Class Imbalance: Only 0.17% (492 out of 284,807) transactions are fraudulent
  • Computational Complexity: 28 PCA-transformed features requiring efficient processing
  • Evolving Fraud Patterns: Adapting to changing fraud tactics over time

๐Ÿ“Š Dataset

We utilized the Credit Card Fraud Detection Dataset from Kaggle, which contains:

โœ” 284,807 transactions (492 fraudulent cases)
โœ” 28 principal components (from PCA transformation)
โœ” Time elapsed features and transaction amounts
โœ” Anonymized data to protect user privacy


๐Ÿ“Š Dataset link

Original Dataset

  • creditcard.csv - The original credit card transaction dataset from Kaggle

Preprocessed Datasets

SMOTE Dataset

  • smote_processed_data.csv - Dataset after applying SMOTE for class balancing

Train/Test Split Files

  • train_preprocessed.csv - Preprocessed training dataset
  • test_preprocessed.csv - Preprocessed testing dataset

Final Presentation link

Credit Card Fraud Detection Final Project presentation Download Link


โš™๏ธ Methodology

1๏ธโƒฃ Data Preprocessing

โœ… Feature scaling with RobustScaler
โœ… Correlation analysis and feature selection
โœ… Class Imbalance Handling: Implemented SMOTE (Synthetic Minority Over-sampling Technique)
โœ… Outlier Detection: Used Isolation Forest to identify and handle anomalies
โœ… Stratified train-test splitting

2๏ธโƒฃ Models Implemented

๐Ÿ”น 1.Random Forest

โœ” Configuration: 100 trees, max depth of 10, balanced class weights
โœ” ROC-AUC Score: 0.979
โœ” Threshold Optimization: Improved F1 score from 0.5784 to 0.8021 (38.67% improvement)
โœ” Feature importance analysis

๐Ÿ”น 1b.Random Forest with SHAP analysis

โœ” Configuration: For computational efficiency, we optimized our original Random Forest model (100 trees, max depth of 10) to a lighter version (50 trees, max depth of 8, increased min_samples_split, parallelized processing) โœ” ROC-AUC Score: 0.979 โœ” Threshold Optimization: Improved F1 score from 0.5784 to 0.8021 (38.67% improvement) โœ” SHAP Analysis Enhancements: Unlike traditional feature importance measures in the base Random Forest, SHAP analysis provides detailed per-prediction feature contribution values, revealing how specific features like V14, V10, and V4 individually influence model decisions, creating a more transparent and interpretable fraud detection system

๐Ÿ”น 2.Neural Network

โœ” Architecture: Three hidden layers (64, 32, 16 neurons) with batch normalization
โœ” Activation: ReLU for hidden layers, Sigmoid for output
โœ” Optimization: Adam with learning rate of 0.001
โœ” Dropout layers (0.3) to prevent overfitting


๐Ÿ“ˆ Evaluation Metrics

  • Precision, Recall, and F1-Score
  • ROC Curve and AUC
  • Confusion Matrix Analysis
  • Probability Distribution Analysis

๐Ÿš€ Results

Model Precision Recall F1-Score AUC
Random Forest 0.96 0.83 0.89 0.97
Neural Network 0.89 0.91 0.90 0.96

๐Ÿ“Š Performance Insights

  • Random Forest: Better interpretability with clearer feature importance rankings
  • Neural Network: Superior at capturing complex fraud patterns with fewer false positives
  • Threshold Optimization: Critical for improving model performance in imbalanced datasets

๐Ÿ”ฌ Key Findings

  1. Model Performance Enhancement: Neural Network achieved high recall (0.85) and precision (0.87)
  2. SMOTE's Role: Transformed dataset from 0.17% fraudulent transactions to a balanced distribution
  3. Comparative Insights: Random Forest provided better interpretability while Neural Network reduced false positives

๐Ÿ”ฎ Future Improvements

  • Implement real-time fraud detection capability
  • Explore ensemble methods for improved performance
  • Develop an adaptive learning approach for evolving fraud patterns
  • Create a web-based dashboard for monitoring fraud detection metrics
  • Implement XAI techniques (SHAP, LIME) for better model interpretability

๐Ÿ’ป Installation and Usage

๐Ÿ“‹ Project Structure

credit-card-fraud-detection/
โ”œโ”€โ”€ code/                           # Source code files
โ”‚   โ”œโ”€โ”€ models/                     # Model implementation
โ”‚   โ”‚   โ”œโ”€โ”€ part4-Random forest.ipynb   # Random Forest model
โ”‚   โ”‚   โ”œโ”€โ”€ part4b-Random forest(SHAP).ipynb  # Random Forest with SHAP analysis model
โ”‚   โ”‚   โ””โ”€โ”€ part5-Neural network.ipynb  # Neural Network model
โ”‚   โ””โ”€โ”€ preprocessing/              # Data preprocessing pipeline
โ”‚       โ”œโ”€โ”€ part1_credit_card_fraud_preprocessing.ipynb  # Feature scaling & initial preprocessing
โ”‚       โ”œโ”€โ”€ part2_credit_card_fraud_smote.ipynb          # SMOTE implementation
โ”‚       โ””โ”€โ”€ part3-outlier_detection_and_models.ipynb     # Outlier detection 
โ”œโ”€โ”€ image/                          # Visualization images
โ”‚   โ”œโ”€โ”€ SMOTE/                      # SMOTE-related visualizations
โ”‚   โ”œโ”€โ”€ neural network/             # Neural network results & visualizations
โ”‚   โ”œโ”€โ”€ outliter detection/         # Outlier detection visualizations
โ”‚   โ”œโ”€โ”€ preprocessed image/         # Data preprocessing visualizations
โ”‚   โ””โ”€โ”€ random foreast/             # Random Forest results & visualizations
โ”œโ”€โ”€ tableau/                        # Tableau files and visualizations
โ”‚   โ””โ”€โ”€ tableau dashboard.png       # Dashboard visualization
โ”œโ”€โ”€ README.md                       # Project documentation
โ”œโ”€โ”€ banner.png                      # Project banner image
โ””โ”€โ”€ workflow.pdf                    # Workflow diagram

๐Ÿ“Œ Requirements

numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
tensorflow==2.15.0
matplotlib==3.8.3
seaborn==0.13.2
imbalanced-learn==0.12.0

๐Ÿš€ Setup and Execution

# Clone the repository
git clone https://github.com/thisissophiawang/credit-card-fraud-detection.git
cd credit-card-fraud-detection

# Install dependencies
pip install -r requirements.txt

# Run the preprocessing pipeline
python src/preprocessing.py

# Train and evaluate models
python src/train_models.py

๐Ÿ”— References

  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  • Chawla, N. V., et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
  • Dal Pozzolo, A., et al. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. IEEE Symposium Series on Computational Intelligence and Data Mining.
  • Fernรกndez, A., et al. (2018). Learning from Imbalanced Data Sets. Springer.

About

A machine learning system to detect fraudulent credit card transactions using Random Forest and Neural Networks. Addresses class imbalance challenges with SMOTE and optimized for real-world application in financial security.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published