A Comparative Study of Resampling and Modeling Strategies
An end-to-end machine learning pipeline for detecting fraudulent credit card transactions, leveraging advanced resampling techniques and ensemble modeling to tackle class imbalance and improve predictive performance.
This project addresses credit card fraud detection using supervised and unsupervised machine learning on a highly imbalanced dataset (~0.18% fraud). It builds a complete ML pipeline covering preprocessing, exploratory analysis, feature engineering, and advanced resampling.
To tackle class imbalance and improve performance, we applied techniques like random undersampling, SMOTE, and a custom KMeans-based method, paired with models such as logistic regression, random forest, KNN, neural networks, and an ensemble voting classifier.
The best-performing model was selected through threshold tuning and F1-score optimization, followed by targeted feature engineering to further boost performance. The pipeline is modular, reproducible, and CLI-configurable—ready for real-world deployment.
-
Total Transactions: 284,807
-
Fraudulent Cases: 492
-
Features:
- V1 to V28 (PCA-transformed)
- Amount, Time
- Class (target: 1 = fraud, 0 = legitimate)
-
Implemented six resampling techniques to address class imbalance:
- baseline (cost-sensetive)
- Random Undersampling
- NearMiss
- Custom KMeans-based undersampling
- Random Oversampling
- SMOTE
-
Trained and compared five machine learning models:
- Logistic Regression
- Random Forest
- K-Nearest Neighbors
- Neural Network (MLPClassifier)
- Ensemble Voting Classifier combining top performers
-
Custom KMeans-based undersampler developed to improve representation of the majority class while preserving data structure
-
Feature engineering to enhance model performance, supported by feature importance analysis to identify influential variables
-
Applied threshold tuning based on F1-score to better balance precision and recall
-
Modular, CLI-driven architecture for reproducible and configurable experimentation
-
Professional final report and cleanly structured Jupyter notebooks for analysis, experimentation, and reproducibility
-
Designed for easy extensibility to test new resampling or modeling approaches
Model | F1 Score | Precision | Recall | PR-AUC |
---|---|---|---|---|
Neural Network (threshold=0.606) | 85.25% | 90.70% | 80.41% | 85.70% |
Random Forest (threshold=0.364) | 84.69% | 83.84% | 85.57% | 85.01% |
K-Nearest Neighbors (threshold=0.414) | 87.05% | 87.50% | 86.60% | 82.81% |
Voting Classifier (threshold=0.283) | 86.73% | 85.86% | 87.63% | 87.73% |
The report includes:
- Motivation and problem framing
- Exploratory data analysis insights and class distribution
- Feature engineering strategies
- In-depth resampling method comparison
- Model evaluation and threshold tuning
- Metric interpretation and real-world considerations
Credit-Card-Fraud-Detection/
├── notebooks/
│ └── credit_fraud.ipynb # Exploratory Data Analysis (EDA) notebook
│
├── summary/
│ └── Project_Report.pdf # Final project report document
│
├── scripts/
│ ├── credit_fraud_train.py # Script for training models
│ ├── credit_fraud_test.py # Script for testing models
│ │
│ ├── credit_fraud_utils_data.py # Data loading, cleaning, and splitting functions
│ ├── credit_fraud_utils_eval.py # Evaluation metric functions
│ ├── kmeans_undersampler.py # Custom KMeans undersampling implementation
│ └── cli_args.py # Command-line argument parser
│
├── saved_models/ # Folder containing all saved models
│ ├── knn_final_model.pkl
│ ├── nn_final_model.pkl
│ ├── rf_feature_importance_model.pkl
│ ├── rf_final_model.pkl
│ └── vc_final_model.pkl
│
├── README.md # Project overview and usage instructions
└── requirements.txt # Python package dependencies list
Clone the repository:
git clone https://github.com/your-username/Credit-Card-Fraud-Detection.git
cd Credit-Card-Fraud-Detection
Install dependencies:
pip install -r requirements.txt
Python 3.9 or later is required for full compatibility.
This project supports flexible configuration via command-line arguments using argparse
. Below is a breakdown of available CLI options by category.
Argument | Description |
---|---|
--train-dir |
Path to the training dataset CSV (default: ../data/split/trainval.csv ) |
--test-dir |
Path to the test dataset CSV (default: ../data/split/test.csv ) |
--save-dir |
Directory to save trained model artifacts (default: ../saved_models/ ) |
--load-dir |
Path to load a saved model for evaluation (default: ../saved_models/final_model.pkl ) |
--model-name |
Name used when saving the model file (e.g., model_v1.pkl ) |
Argument | Description |
---|---|
--scaling |
Feature scaling method: standard , minmax , robust , or raw |
--balance |
Class balancing method: none , undersample , nearmiss , kmeans , cluster_centroids , oversample , smote |
--n-neighbors |
Number of neighbors for SMOTE or NearMiss (default: 5 ) |
--nearmiss-version |
Version of NearMiss to use (1 , 2 , or 3 ) |
--n-clusters |
Number of clusters for KMeans undersampling (default: 500 ) |
--use-pca |
Flag to apply PCA before training (default: False ) |
--n-components |
Number of PCA components to retain (default: 18 ) |
Argument | Description |
---|---|
--model |
Model to train: LR (Logistic Regression), NN (MLP), RF , VC , or KNN |
Argument | Description |
---|---|
--knn-n-neighbors |
Number of neighbors to use (default: 7 ) |
--knn-weights |
Weight strategy: uniform or distance |
Argument | Description |
---|---|
--lr-c |
Inverse of regularization strength (default: 1 ) |
--lr-max-iter |
Maximum iterations (default: 10000 ) |
Argument | Description |
---|---|
--rf-n-estimators |
Number of trees (default: 100 ) |
--rf-max-depth |
Maximum tree depth (default: None ) |
Argument | Description |
---|---|
--nn-hidden-layers |
Comma-separated hidden layer sizes (e.g., 128,64 ) |
--nn-max-iter |
Maximum training iterations (default: 3000 ) |
--nn-activation |
Activation function: relu , tanh , or logistic |
--nn-lr |
Learning rate (default: 0.001 ) |
--nn-alpha |
L2 regularization parameter (default: 0.001 ) |
Argument | Description |
---|---|
--grid-search |
Enable GridSearchCV for hyperparameter tuning |
--cost-sensitive |
Use cost-sensitive learning via class weighting |
Train a neural network with SMOTE and PCA:
python scripts/credit_fraud_train.py \
--train-dir data/split/trainval.csv \
--model NN \
--balance smote \
--scaling standard \
--use-pca \
--model-name smote_nn_pca
- Python 3.9+
- scikit-learn
- imbalanced-learn
- pandas, numpy
- matplotlib, seaborn
- argparse
- t-SNE (
sklearn.manifold
) - Jupyter Notebook
Author: Taher Alabbar
Email: t.alabbar.ca@gmail.com, LinkedIn