This project implements a Machine Learning-based system to detect fraudulent credit card transactions. Using the highly imbalanced creditcard.csv
dataset from Kaggle, a Logistic Regression model is trained and evaluated to classify transactions as legitimate or fraudulent.
- Source: Kaggle - Credit Card Fraud Detection
- Size: 284,807 transactions
- Fraud cases: 492 (≈0.17% of total)
Note: The dataset is highly imbalanced and contains anonymized features (V1–V28), Amount
, and Time
.
- Logistic Regression
A supervised learning algorithm used for binary classification (fraud vs. non-fraud).
- Python 3.x
- Pandas – for data manipulation
- NumPy – for numerical operations
- Scikit-learn – model building, evaluation
- Matplotlib / Seaborn – visualization
- Imbalanced-learn (optional) – for resampling techniques like SMOTE
Due to class imbalance, accuracy isn't sufficient alone. Other metrics used:
- Confusion Matrix
- Precision / Recall
- F1-Score
- ROC-AUC Curve
CreditCardFraudDetection/ │ ├── creditcard.csv # Dataset (download from Kaggle) ├── fraud_detection.ipynb # Jupyter Notebook └── README.md # Project documentation
- Data preprocessing and normalization
- Handling imbalance (e.g., SMOTE / UnderSampling)
- Splitting into training and test sets
- Training Logistic Regression model
- Evaluating with appropriate metrics
- Visualizing performance using ROC and confusion matrix
Predicted No Fraud | Predicted Fraud | |
---|---|---|
Actual No Fraud | 56,000+ | 30 |
Actual Fraud | 40 | 400+ |
Note: Your results may vary based on train/test split and resampling.