- ๐ Project Overview
- ๐ Key Challenges Addressed
- ๐ Dataset
- ๐ Dataset Link
- ๐ฅ Final Presentation Link
- โ๏ธ Methodology
- ๐ Evaluation Metrics
- ๐ Results
- ๐ฌ Key Findings
- ๐ฎ Future Improvements
- ๐ป Installation and Usage
- ๐ Project Structure
- ๐ Requirements
- ๐ Setup and Execution
- ๐ References
This project implements advanced data mining techniques to detect fraudulent credit card transactions.
With global fraud losses exceeding $28.65 billion annually, we aim to tackle key challenges such as imbalanced data, computational efficiency, and evolving fraud patterns.
- Severe Class Imbalance: Only 0.17% (492 out of 284,807) transactions are fraudulent
- Computational Complexity: 28 PCA-transformed features requiring efficient processing
- Evolving Fraud Patterns: Adapting to changing fraud tactics over time
We utilized the Credit Card Fraud Detection Dataset from Kaggle, which contains:
โ 284,807 transactions (492 fraudulent cases)
โ 28 principal components (from PCA transformation)
โ Time elapsed features and transaction amounts
โ Anonymized data to protect user privacy
- creditcard.csv - The original credit card transaction dataset from Kaggle
- smote_processed_data.csv - Dataset after applying SMOTE for class balancing
- train_preprocessed.csv - Preprocessed training dataset
- test_preprocessed.csv - Preprocessed testing dataset
Credit Card Fraud Detection Final Project presentation Download Link
โ
Feature scaling with RobustScaler
โ
Correlation analysis and feature selection
โ
Class Imbalance Handling: Implemented SMOTE (Synthetic Minority Over-sampling Technique)
โ
Outlier Detection: Used Isolation Forest to identify and handle anomalies
โ
Stratified train-test splitting
โ Configuration: 100 trees, max depth of 10, balanced class weights
โ ROC-AUC Score: 0.979
โ Threshold Optimization: Improved F1 score from 0.5784 to 0.8021 (38.67% improvement)
โ Feature importance analysis
โ Configuration: For computational efficiency, we optimized our original Random Forest model (100 trees, max depth of 10) to a lighter version (50 trees, max depth of 8, increased min_samples_split, parallelized processing) โ ROC-AUC Score: 0.979 โ Threshold Optimization: Improved F1 score from 0.5784 to 0.8021 (38.67% improvement) โ SHAP Analysis Enhancements: Unlike traditional feature importance measures in the base Random Forest, SHAP analysis provides detailed per-prediction feature contribution values, revealing how specific features like V14, V10, and V4 individually influence model decisions, creating a more transparent and interpretable fraud detection system
โ Architecture: Three hidden layers (64, 32, 16 neurons) with batch normalization
โ Activation: ReLU for hidden layers, Sigmoid for output
โ Optimization: Adam with learning rate of 0.001
โ Dropout layers (0.3) to prevent overfitting
- Precision, Recall, and F1-Score
- ROC Curve and AUC
- Confusion Matrix Analysis
- Probability Distribution Analysis
Model | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|
Random Forest | 0.96 | 0.83 | 0.89 | 0.97 |
Neural Network | 0.89 | 0.91 | 0.90 | 0.96 |
- Random Forest: Better interpretability with clearer feature importance rankings
- Neural Network: Superior at capturing complex fraud patterns with fewer false positives
- Threshold Optimization: Critical for improving model performance in imbalanced datasets
- Model Performance Enhancement: Neural Network achieved high recall (0.85) and precision (0.87)
- SMOTE's Role: Transformed dataset from 0.17% fraudulent transactions to a balanced distribution
- Comparative Insights: Random Forest provided better interpretability while Neural Network reduced false positives
- Implement real-time fraud detection capability
- Explore ensemble methods for improved performance
- Develop an adaptive learning approach for evolving fraud patterns
- Create a web-based dashboard for monitoring fraud detection metrics
- Implement XAI techniques (SHAP, LIME) for better model interpretability
credit-card-fraud-detection/
โโโ code/ # Source code files
โ โโโ models/ # Model implementation
โ โ โโโ part4-Random forest.ipynb # Random Forest model
โ โ โโโ part4b-Random forest(SHAP).ipynb # Random Forest with SHAP analysis model
โ โ โโโ part5-Neural network.ipynb # Neural Network model
โ โโโ preprocessing/ # Data preprocessing pipeline
โ โโโ part1_credit_card_fraud_preprocessing.ipynb # Feature scaling & initial preprocessing
โ โโโ part2_credit_card_fraud_smote.ipynb # SMOTE implementation
โ โโโ part3-outlier_detection_and_models.ipynb # Outlier detection
โโโ image/ # Visualization images
โ โโโ SMOTE/ # SMOTE-related visualizations
โ โโโ neural network/ # Neural network results & visualizations
โ โโโ outliter detection/ # Outlier detection visualizations
โ โโโ preprocessed image/ # Data preprocessing visualizations
โ โโโ random foreast/ # Random Forest results & visualizations
โโโ tableau/ # Tableau files and visualizations
โ โโโ tableau dashboard.png # Dashboard visualization
โโโ README.md # Project documentation
โโโ banner.png # Project banner image
โโโ workflow.pdf # Workflow diagram
numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
tensorflow==2.15.0
matplotlib==3.8.3
seaborn==0.13.2
imbalanced-learn==0.12.0
# Clone the repository
git clone https://github.com/thisissophiawang/credit-card-fraud-detection.git
cd credit-card-fraud-detection
# Install dependencies
pip install -r requirements.txt
# Run the preprocessing pipeline
python src/preprocessing.py
# Train and evaluate models
python src/train_models.py
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
- Chawla, N. V., et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
- Dal Pozzolo, A., et al. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. IEEE Symposium Series on Computational Intelligence and Data Mining.
- Fernรกndez, A., et al. (2018). Learning from Imbalanced Data Sets. Springer.