This project implements an end-to-end ETL Machine Learning pipeline for detecting credit card fraud. The pipeline is orchestrated using Apache Airflow, containerized with Docker, and built in Python. This project serves as a practical demonstration of building and deploying a machine learning system.
- Automated ETL Process: Uses Apache Airflow to define, schedule, and monitor the data pipeline.
- Data Extraction: Fetches data from a PostgreSQL database using a SQL query.
- Data Cleaning & Preprocessing: Includes steps for cleaning raw data (
cleaner.py
). - Feature Engineering:
- Categorical Encoding: Transforms categorical features into a numerical format using Ordinal Encoding (
encoder.py
). Allows for saving the trained encoder or using a pre-existing one. - Numerical Scaling: Scales numerical features using RobustScaler to handle outliers effectively (
robust_scaler.py
). Allows for saving the trained scaler or using a pre-existing one.
- Categorical Encoding: Transforms categorical features into a numerical format using Ordinal Encoding (
- Advanced Model Training:
- Uses a
DecisionTreeClassifier
as the base model (model2.py
). - Employs Scikit-learn
Pipeline
to chain preprocessing and modeling steps. - Performs hyperparameter optimization using
GridSearchCV
to find the best model configuration. - Handles imbalanced datasets by optimizing for metrics like PR AUC.
- Uses a
- Model Evaluation: Evaluates the trained model using a comprehensive set of metrics suitable for imbalanced classification (Precision, Recall, F1-score, ROC AUC, PR AUC, Confusion Matrix).
- Artifact Management: Saves the trained model (pipeline), evaluation metrics (JSON), and performance plots (ROC curve, Precision-Recall curve).
- Configuration Driven:
- Model parameters, hyperparameter grids, and output paths are managed via a JSON configuration file (
dags/src/utils/model_config.json
). - Database credentials and sensitive information are managed using environment variables on Airflow.
- Logging is configured using a YAML file (
dags/config/log_config.yml
) to improve Airflow logs.
- Model parameters, hyperparameter grids, and output paths are managed via a JSON configuration file (
- Containerized Environment: Utilizes Docker and
docker-compose.yaml
to set up a reproducible environment with Airflow.
- Orchestration: Apache Airflow
- Containerization: Docker, Docker Compose
- Programming Language: Python
- Data Handling & ML: Pandas, NumPy, Scikit-learn
- Database Interaction: SQLAlchemy (for
db_helpers.py
). - Plotting: Matplotlib
- Configuration: JSON, YAML
- Environment Management:
airflow-variables
-
Prerequisites:
- Docker: Install Docker
-
Clone the Repository:
git clone <repository-url> cd Machine_Learning_Pipeline
-
Configure Environment Variables:
- The pipeline expects database credentials. These are loaded via
Variable.get()
in Airflow. - Create the following Airflow Variables (key-value pairs):
HOST
: Your database host.PORT
: Your database port.DATABASE
: Your database name.USER
: Your database username.PASSWORD
: Your database password.
- The pipeline expects database credentials. These are loaded via
-
Build and Run Airflow Environment:
- From the project root directory (where
docker-compose.yaml
is located):docker-compose build docker-compose up
- From the project root directory (where
- Navigate to Airflow UI: Open
http://localhost:8080
. - Locate the DAG: Find the
fraud_detection_etl_and_training_pipeline
DAG in the list. - Unpause the DAG: Click the toggle switch to unpause it.
- Trigger the DAG:
- Click on the DAG name.
- Click the "Play" button.
- Monitor Execution: Observe the pipeline tasks progressing through the Airflow UI.
- View Results:
- Trained Model: Saved in the directory
dags/models/trained_models/first_model.joblib
. - Metrics: Saved in the directory
dags/models/trained_models/first_model_metrics.json
. - Plots: Saved in the directory
dags/plots/
.
- Trained Model: Saved in the directory
- Pipeline Configuration: Key parameters for model training, hyperparameter grids, and output paths are defined in
dags/src/utils/model_config.json
. - Logging: Configured via
dags/config/log_config.yml
. Logs are written tostdout
and a rotating file handler (e.g.,dags/logs/ML_Pipeline.log
). - Database for Data Source: The
extract_data_from_db
task connects to a PostgreSQL database. Connection parameters must be set as Airflow Variables (see Setup). - Airflow Backend: The
docker-compose.yaml
sets up PostgreSQL as the metadata database for Airflow itself.