In this repository, I employ multiple machine learning models to detect fraudulent financial transactions and then use the best model to construct risk score for each transaction. The notebook includes data preprocessing, feature engineering, model training with class imbalance handling, and risk scoring capabilities.
- Multiple Model Evaluation: Tests Logistic Regression, Random Forest, and Deep Neural Networks
- Class Imbalance Handling: Implements SMOTE (Synthetic Minority Over-sampling Technique)
- Comprehensive Feature Engineering: Creates meaningful transaction features
- Risk Scoring: Generates interpretable risk categories for transactions
- Detailed Evaluation: Provides multiple performance metrics and visualizations
For the analysis, I use the Online Payments Fraud Detection Dataset from Kaggle containing financial transaction records (total 6362620) with fraud labels. The class "isFraud" is heavily imbalanced with only 0.13% transactions as fraudulunt. Therefore, I use SMOTE (Synthetic Minority Over-sampling Technique) to balance the class distribution
- Clone this repository
- Install required packages:
pip install -r requirements.txt
- Place your transaction data in the
data/
directory - Run the script:
python fraud_risk_scoring.py
-
Data Preprocessing:
- Feature engineering (time-based features, transaction ratios, etc.)
- One-hot encoding of categorical variables
- Robust scaling of numerical features
-
Model Training:
- Logistic Regression
- Random Forest
- Deep Neural Network
-
Evaluation Metrics:
- Accuracy, Precision, Recall, F1 score, and ROC AUC scores
- Confusion matrices
- Dataset shows extreme class imbalance (0.13% fraud)
- Random Forest with SMOTE achieved highest recall (99.67%)
fraud_scoring_models.pkl
: Saved best model (Random Forest)model_metadata.json
: Training metadata and performance metricsscored_transactions.csv
: All transactions with risk scores1. Top_features.pdf
: Visualization of feature importances2. Risk_score_comparison.pdf
: Risk score comparison visualization
- Neural networks are skipped for SMOTE as they handle class imbalance internally
- Random Forest uses
balanced_subsample
for built-in class weighting - Early stopping implemented for neural networks
Transactions are classified into 5 risk levels based on predicted fraud probability:
- Very Low (0-10%)
- Low (10-30%)
- Medium (30-70%)
- High (70-90%)
- Very High (90-100%)