Credit Card Fraud Detection: A Complete ML Pipeline

A Comparative Study of Resampling and Modeling Strategies

An end-to-end machine learning pipeline for detecting fraudulent credit card transactions, leveraging advanced resampling techniques and ensemble modeling to tackle class imbalance and improve predictive performance.

Overview

This project addresses credit card fraud detection using supervised and unsupervised machine learning on a highly imbalanced dataset (~0.18% fraud). It builds a complete ML pipeline covering preprocessing, exploratory analysis, feature engineering, and advanced resampling.

To tackle class imbalance and improve performance, we applied techniques like random undersampling, SMOTE, and a custom KMeans-based method, paired with models such as logistic regression, random forest, KNN, neural networks, and an ensemble voting classifier.

The best-performing model was selected through threshold tuning and F1-score optimization, followed by targeted feature engineering to further boost performance. The pipeline is modular, reproducible, and CLI-configurable—ready for real-world deployment.

Dataset

Source: Kaggle - Credit Card Fraud Detection
Total Transactions: 284,807
Fraudulent Cases: 492
Features:
- V1 to V28 (PCA-transformed)
- Amount, Time
- Class (target: 1 = fraud, 0 = legitimate)

Key Highlights

Implemented six resampling techniques to address class imbalance:
- baseline (cost-sensetive)
- Random Undersampling
- NearMiss
- Custom KMeans-based undersampling
- Random Oversampling
- SMOTE
Trained and compared five machine learning models:
- Logistic Regression
- Random Forest
- K-Nearest Neighbors
- Neural Network (MLPClassifier)
- Ensemble Voting Classifier combining top performers
Custom KMeans-based undersampler developed to improve representation of the majority class while preserving data structure
Feature engineering to enhance model performance, supported by feature importance analysis to identify influential variables
Applied threshold tuning based on F1-score to better balance precision and recall
Modular, CLI-driven architecture for reproducible and configurable experimentation
Professional final report and cleanly structured Jupyter notebooks for analysis, experimentation, and reproducibility
Designed for easy extensibility to test new resampling or modeling approaches

Model Performance Overview

Model	F1 Score	Precision	Recall	PR-AUC
Neural Network (threshold=0.606)	85.25%	90.70%	80.41%	85.70%
Random Forest (threshold=0.364)	84.69%	83.84%	85.57%	85.01%
K-Nearest Neighbors (threshold=0.414)	87.05%	87.50%	86.60%	82.81%
Voting Classifier (threshold=0.283)	86.73%	85.86%	87.63%	87.73%

📄 Full Report

The report includes:

Motivation and problem framing
Exploratory data analysis insights and class distribution
Feature engineering strategies
In-depth resampling method comparison
Model evaluation and threshold tuning
Metric interpretation and real-world considerations

👉 View Full Report (PDF)

Project Structure

Credit-Card-Fraud-Detection/
├── notebooks/
│   └── credit_fraud.ipynb             # Exploratory Data Analysis (EDA) notebook
│
├── summary/
│   └── Project_Report.pdf             # Final project report document
│
├── scripts/
│   ├── credit_fraud_train.py          # Script for training models
│   ├── credit_fraud_test.py           # Script for testing models
│   │
│   ├── credit_fraud_utils_data.py     # Data loading, cleaning, and splitting functions
│   ├── credit_fraud_utils_eval.py     # Evaluation metric functions
│   ├── kmeans_undersampler.py         # Custom KMeans undersampling implementation
│   └── cli_args.py                    # Command-line argument parser
│
├── saved_models/                      # Folder containing all saved models
│   ├── knn_final_model.pkl
│   ├── nn_final_model.pkl
│   ├── rf_feature_importance_model.pkl
│   ├── rf_final_model.pkl
│   └── vc_final_model.pkl
│
├── README.md                         # Project overview and usage instructions
└── requirements.txt                  # Python package dependencies list

Installation

Clone the repository:

git clone https://github.com/your-username/Credit-Card-Fraud-Detection.git
cd Credit-Card-Fraud-Detection

Install dependencies:

pip install -r requirements.txt

Python 3.9 or later is required for full compatibility.

Command-Line Interface (CLI Options)

This project supports flexible configuration via command-line arguments using argparse. Below is a breakdown of available CLI options by category.

File Paths

Argument	Description
`--train-dir`	Path to the training dataset CSV (default: `../data/split/trainval.csv`)
`--test-dir`	Path to the test dataset CSV (default: `../data/split/test.csv`)
`--save-dir`	Directory to save trained model artifacts (default: `../saved_models/`)
`--load-dir`	Path to load a saved model for evaluation (default: `../saved_models/final_model.pkl`)
`--model-name`	Name used when saving the model file (e.g., `model_v1.pkl`)

Preprocessing Options

Argument	Description
`--scaling`	Feature scaling method: `standard`, `minmax`, `robust`, or `raw`
`--balance`	Class balancing method: `none`, `undersample`, `nearmiss`, `kmeans`, `cluster_centroids`, `oversample`, `smote`
`--n-neighbors`	Number of neighbors for SMOTE or NearMiss (default: `5`)
`--nearmiss-version`	Version of NearMiss to use (`1`, `2`, or `3`)
`--n-clusters`	Number of clusters for KMeans undersampling (default: `500`)
`--use-pca`	Flag to apply PCA before training (default: `False`)
`--n-components`	Number of PCA components to retain (default: `18`)

Model Selection

Argument	Description
`--model`	Model to train: `LR` (Logistic Regression), `NN` (MLP), `RF`, `VC`, or `KNN`

Model Hyperparameters

K-Nearest Neighbors (KNN)

Argument	Description
`--knn-n-neighbors`	Number of neighbors to use (default: `7`)
`--knn-weights`	Weight strategy: `uniform` or `distance`

Logistic Regression (LR)

Argument	Description
`--lr-c`	Inverse of regularization strength (default: `1`)
`--lr-max-iter`	Maximum iterations (default: `10000`)

Random Forest (RF)

Argument	Description
`--rf-n-estimators`	Number of trees (default: `100`)
`--rf-max-depth`	Maximum tree depth (default: `None`)

Neural Network (NN / MLP)

Argument	Description
`--nn-hidden-layers`	Comma-separated hidden layer sizes (e.g., `128,64`)
`--nn-max-iter`	Maximum training iterations (default: `3000`)
`--nn-activation`	Activation function: `relu`, `tanh`, or `logistic`
`--nn-lr`	Learning rate (default: `0.001`)
`--nn-alpha`	L2 regularization parameter (default: `0.001`)

Additional Options

Argument	Description
`--grid-search`	Enable `GridSearchCV` for hyperparameter tuning
`--cost-sensitive`	Use cost-sensitive learning via class weighting

Example Usage

Train a neural network with SMOTE and PCA:

python scripts/credit_fraud_train.py \
    --train-dir data/split/trainval.csv \
    --model NN \
    --balance smote \
    --scaling standard \
    --use-pca \
    --model-name smote_nn_pca

Tools and Technologies

Python 3.9+
scikit-learn
imbalanced-learn
pandas, numpy
matplotlib, seaborn
argparse
t-SNE (sklearn.manifold)
Jupyter Notebook

Contact

Author: Taher Alabbar
Email: t.alabbar.ca@gmail.com, LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Credit Card Fraud Detection: A Complete ML Pipeline

Overview

Dataset

Key Highlights

Model Performance Overview

📄 Full Report

Project Structure

Installation

Command-Line Interface (CLI Options)

File Paths

Preprocessing Options

Model Selection

Model Hyperparameters

K-Nearest Neighbors (KNN)

Logistic Regression (LR)

Random Forest (RF)

Neural Network (NN / MLP)

Additional Options

Example Usage

Tools and Technologies

Contact

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
notebooks		notebooks
saved_models		saved_models
scripts		scripts
summary		summary
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

TaherAlabbr/Credit-Card-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection: A Complete ML Pipeline

Overview

Dataset

Key Highlights

Model Performance Overview

📄 Full Report

Project Structure

Installation

Command-Line Interface (CLI Options)

File Paths

Preprocessing Options

Model Selection

Model Hyperparameters

K-Nearest Neighbors (KNN)

Logistic Regression (LR)

Random Forest (RF)

Neural Network (NN / MLP)

Additional Options

Example Usage

Tools and Technologies

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages