A complete production-ready MLOps framework with built-in distributed training, monitoring, and CI/CD. Deploy ML models to production with confidence using our battle-tested infrastructure. This project implements an end-to-end ML pipeline that follows industry best practices for developing, deploying, and maintaining ML models in production environments at scale.
- Features
- Architecture
- Getting Started
- Usage
- CI/CD Pipeline
- Development
- Advanced Usage
- Security
- License
- Automated Data Pipeline: Robust data validation, cleaning, and feature engineering
- Experiment Tracking: Comprehensive version control for models, datasets, and hyperparameters with MLflow
- Distributed Training: GPU-accelerated training across multiple nodes for large models
- Model Registry: Centralized model storage and versioning with lifecycle management
- Continuous Integration/Deployment: Automated testing, validation, and deployment pipelines
- Model Serving API: Fast and scalable REST API with input validation and automatic documentation
- Model Monitoring: Performance tracking, drift detection, and automated retraining triggers
- A/B Testing: Framework for model experimentation and controlled rollouts
- Infrastructure as Code: Docker containers and Kubernetes configurations for reliable deployments
This system follows a modular microservice architecture with the following components:
graph TD
%% Main title and styles
classDef pipeline fill:#f0f6ff,stroke:#3273dc,color:#3273dc,stroke-width:2px
classDef component fill:#ffffff,stroke:#209cee,color:#209cee,stroke-width:1.5px
classDef note fill:#fffaeb,stroke:#ffdd57,color:#946c00,stroke-width:1px,stroke-dasharray:5 5
classDef infra fill:#e3fcf7,stroke:#00d1b2,color:#00d1b2,stroke-width:1.5px,stroke-dasharray:5 5
%% Infrastructure
subgraph K8S["Kubernetes Cluster"]
%% Data Pipeline
subgraph DP["Data Pipeline"]
DI[Data Ingestion]:::component
DV[Data Validation]:::component
FE[Feature Engineering]:::component
FSN[Feature Store Integration]:::note
DI --> DV
DV --> FE
end
%% Model Training
subgraph MT["Model Training"]
ET[Experiment Tracking - MLflow]:::component
DT[Distributed Training]:::component
ME[Model Evaluation]:::component
ABN[A/B Testing Framework]:::note
ET --> DT
DT --> ME
end
%% Model Registry
subgraph MR["Model Registry"]
MV[Model Versioning]:::component
MS[Metadata Storage]:::component
MCI[CI/CD Integration]:::note
MV --> MS
end
%% API Layer
subgraph API["API Layer"]
FA[FastAPI Application]:::component
PE[Prediction Endpoints]:::component
HM[Health & Metadata APIs]:::component
HPA[Horizontal Pod Autoscaling]:::note
FA --> PE
FA --> HM
end
%% Monitoring
subgraph MON["Monitoring"]
PM[Prometheus Metrics]:::component
GD[Grafana Dashboards]:::component
DD[Feature-level Drift Detection]:::component
RT[Automated Retraining Triggers]:::component
AM[Alert Manager Integration]:::note
MPT[Model Performance Tracking]:::component
DQM[Data Quality Monitoring]:::component
ABT[A/B Testing Analytics]:::component
LA[Log Aggregation]:::component
DT2[Distributed Tracing]:::note
PM --> GD
PM --> DD
DD --> RT
MPT --> DQM
DQM --> ABT
ABT --> LA
end
%% Component relationships
DP -->|Training Data| MT
DP -->|Metadata| MR
MT -->|Model Artifacts| MR
MR -->|Latest Model| API
API -->|Metrics| MON
MT -->|Performance Metrics| MON
end
%% CI/CD Pipeline
CICD[CI/CD Pipeline: GitHub Actions]:::infra
CICD -->|Deploy| K8S
%% Apply classes
class DP,MT,MR,API,MON pipeline
-
Data Pipeline
- Data Ingestion: Connectors for various data sources (databases, object storage, streaming)
- Data Validation: Schema validation, data quality checks, and anomaly detection
- Feature Engineering: Feature transformation, normalization, and feature store integration
-
Model Training
- Experiment Tracking: MLflow integration for tracking parameters, metrics, and artifacts
- Distributed Training: PyTorch distributed training for efficient model training
- Model Evaluation: Comprehensive metrics calculation and validation
-
Model Registry
- Model Versioning: Storage and versioning of models with metadata
- Artifact Management: Efficient storage of model artifacts and associated files
- Deployment Management: Tracking of model deployment status
-
API Layer
- FastAPI Application: High-performance API with automatic OpenAPI documentation
- Prediction Endpoints: RESTful endpoints for model inference
- Health & Metadata: Endpoints for system health checks and model metadata
-
Monitoring System
- Metrics Collection: Prometheus integration for metrics collection
- Drift Detection: Statistical methods to detect data and concept drift
- Performance Tracking: Continuous monitoring of model performance metrics
- Automated Retraining: Triggers for retraining based on drift detection
-
Development Workflow:
flowchart LR DS[Data Scientist] --> |Develops Model| DEV[Development Environment] DEV --> |Commits Code| GIT[Git Repository] GIT --> |Triggers| CI[CI/CD Pipeline] CI --> |Runs Tests| TEST[Test Suite] TEST --> |Validates Model| VAL[Model Validation] VAL --> |Performance Testing| PERF[Performance Tests] PERF --> |Builds| BUILD[Docker Image] BUILD --> |Deploys| DEPLOY[Kubernetes Cluster]
-
Production Data Flow:
flowchart LR DATA[Data Sources] --> |Ingestion| PIPE[Data Pipeline] PIPE --> |Validated Data| TRAIN[Training Pipeline] TRAIN --> |Trained Model| REG[Model Registry] REG --> |Latest Model| API[API Service] API --> |Predictions| USERS[End Users] API --> |Metrics| MON[Monitoring] MON --> |Drift Detected| RETRAIN[Retraining Trigger] RETRAIN --> TRAIN
Install the latest stable version from PyPI:
pip install mlops-forge
For development, install from source:
# Clone the repository
git clone https://github.com/TaimoorKhan10/MLOps-Forge.git
cd MLOps-Forge
# Create and activate virtual environment
python -m venv venv
# On Windows: .\venv\Scripts\activate
# On macOS/Linux: source venv/bin/activate
# Install in development mode with all dependencies
pip install -e ".[dev]"
- Python 3.9 or 3.10
- Docker and Docker Compose (for containerization)
- Kubernetes (for production deployment)
- Cloud provider account (AWS/GCP/Azure) for cloud deployments
-
Environment Variables: Create a
.env
file based on the provided.env.example
:# MLflow Configuration MLFLOW_TRACKING_URI=http://mlflow:5000 MLFLOW_S3_ENDPOINT_URL=http://minio:9000 # AWS Configuration for Deployment AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-key AWS_REGION=us-west-2 # Kubernetes Configuration K8S_NAMESPACE=mlops-production
-
Infrastructure Setup:
# For local development with Docker Compose docker-compose up -d # For Kubernetes deployment kubectl apply -f infrastructure/kubernetes/
from mlops_production_system.pipeline import DataPipeline
# Initialize the pipeline
pipeline = DataPipeline(config_path="config/pipeline_config.yaml")
# Run the pipeline
processed_data = pipeline.run(input_data_path="data/raw/training_data.csv")
from mlops_production_system.models import ModelTrainer
from mlops_production_system.training import distributed_trainer
# For single-node training
trainer = ModelTrainer(model_config="config/model_config.yaml")
model = trainer.train(X_train, y_train)
metrics = trainer.evaluate(X_test, y_test)
# For distributed training
distributed_trainer.run(
model_class="mlops_production_system.models.CustomModel",
data_path="data/processed/training_data.parquet",
num_nodes=4
)
# Deploy model using CLI
mlops deploy --model-name="my-model" --model-version=1 --environment=production
# Or using the Python API
from mlops_production_system.deployment import ModelDeployer
deployer = ModelDeployer()
deployer.deploy(model_name="my-model", model_version=1, environment="production")
from mlops_production_system.monitoring import DriftDetector, PerformanceMonitor
# Monitor for drift
drift_detector = DriftDetector(reference_data="data/reference.parquet")
drift_results = drift_detector.detect(new_data="data/production_data.parquet")
# Monitor model performance
performance_monitor = PerformanceMonitor(model_name="my-model", model_version=1)
performance_metrics = performance_monitor.get_metrics(timeframe="last_24h")
The system uses GitHub Actions for CI/CD pipeline, configured in .github/workflows/main.yml
. The pipeline includes:
-
Code Quality:
- Linting with flake8
- Type checking with mypy
- Security scanning with bandit
-
Testing:
- Unit tests with pytest
- Integration tests
- Code coverage reporting
-
Model Validation:
- Performance benchmarking
- Model quality checks
- Validation against baseline metrics
-
Deployment:
- Docker image building
- Image pushing to container registry
- Kubernetes deployment updates
All secrets and credentials are stored securely in GitHub Secrets and only accessed during workflow execution.
MLOps-Production-System/
βββ .github/ # GitHub Actions workflows
βββ config/ # Configuration files
βββ data/ # Data directories (gitignored)
βββ docs/ # Documentation
βββ infrastructure/ # Infrastructure as code
β βββ docker/ # Docker configurations
β βββ kubernetes/ # Kubernetes manifests
β βββ terraform/ # Terraform for cloud resources
βββ notebooks/ # Jupyter notebooks
βββ scripts/ # Utility scripts
βββ src/ # Source code
β βββ mlops_production_system/
β βββ api/ # FastAPI application
β βββ models/ # ML models
β βββ pipeline/ # Data pipeline
β βββ training/ # Training code
β βββ monitoring/ # Monitoring tools
β βββ utils/ # Utilities
βββ tests/ # Test suite
βββ .env.example # Example environment variables
βββ Dockerfile # Main Dockerfile
βββ pyproject.toml # Project metadata
βββ README.md # This file
We follow the GitFlow branching model:
- Create a feature branch from
develop
:git checkout -b feature/your-feature
- Make your changes and commit:
git commit -m "Add feature"
- Push your branch:
git push origin feature/your-feature
- Open a Pull Request against the
develop
branch
All PRs must pass CI checks and code review before being merged.
The system supports distributed training using PyTorch's DistributedDataParallel for efficient multi-node training:
# Example Kubernetes configuration in infrastructure/kubernetes/distributed-training.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
parallelism: 4
template:
spec:
containers:
- name: trainer
image: your-registry/mlops-trainer:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: WORLD_SIZE
value: "4"
The A/B testing framework allows comparing multiple models in production:
from mlops_production_system.monitoring import ABTestingFramework
# Set up A/B test between two models
ab_test = ABTestingFramework()
ab_test.create_experiment(
name="pricing_model_comparison",
### Configuration
1. **Environment Variables**:
Create a `.env` file based on the provided `.env.example`:
MLFLOW_TRACKING_URI=http://mlflow:5000 MLFLOW_S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-key AWS_REGION=us-west-2
K8S_NAMESPACE=mlops-production
2. **Infrastructure Setup**:
```bash
# For local development with Docker Compose
docker-compose up -d
# For Kubernetes deployment
kubectl apply -f infrastructure/kubernetes/
from mlops_production_system.pipeline import DataPipeline
# Initialize the pipeline
pipeline = DataPipeline(config_path="config/pipeline_config.yaml")
# Run the pipeline
processed_data = pipeline.run(input_data_path="data/raw/training_data.csv")
from mlops_production_system.models import ModelTrainer
from mlops_production_system.training import distributed_trainer
# For single-node training
trainer = ModelTrainer(model_config="config/model_config.yaml")
model = trainer.train(X_train, y_train)
metrics = trainer.evaluate(X_test, y_test)
# For distributed training
distributed_trainer.run(
model_class="mlops_production_system.models.CustomModel",
data_path="data/processed/training_data.parquet",
num_nodes=4
)
# Deploy model using CLI
mlops deploy --model-name="my-model" --model-version=1 --environment=production
# Or using the Python API
from mlops_production_system.deployment import ModelDeployer
deployer = ModelDeployer()
deployer.deploy(model_name="my-model", model_version=1, environment="production")
from mlops_production_system.monitoring import DriftDetector, PerformanceMonitor
# Monitor for drift
drift_detector = DriftDetector(reference_data="data/reference.parquet")
drift_results = drift_detector.detect(new_data="data/production_data.parquet")
# Monitor model performance
performance_monitor = PerformanceMonitor(model_name="my-model", model_version=1)
performance_metrics = performance_monitor.get_metrics(timeframe="last_24h")
The system uses GitHub Actions for CI/CD pipeline, configured in .github/workflows/main.yml
. The pipeline includes:
-
Code Quality:
- Linting with flake8
- Type checking with mypy
- Security scanning with bandit
-
Testing:
- Unit tests with pytest
- Integration tests
- Code coverage reporting
-
Model Validation:
- Performance benchmarking
- Model quality checks
- Validation against baseline metrics
-
Deployment:
- Docker image building
- Image pushing to container registry
- Kubernetes deployment updates
All secrets and credentials are stored securely in GitHub Secrets and only accessed during workflow execution.
MLOps-Production-System/
βββ .github/ # GitHub Actions workflows
βββ config/ # Configuration files
βββ data/ # Data directories (gitignored)
βββ docs/ # Documentation
βββ infrastructure/ # Infrastructure as code
β βββ docker/ # Docker configurations
β βββ kubernetes/ # Kubernetes manifests
β βββ terraform/ # Terraform for cloud resources
βββ notebooks/ # Jupyter notebooks
βββ scripts/ # Utility scripts
βββ src/ # Source code
β βββ mlops_production_system/
β βββ api/ # FastAPI application
β βββ models/ # ML models
β βββ pipeline/ # Data pipeline
β βββ training/ # Training code
β βββ monitoring/ # Monitoring tools
β βββ utils/ # Utilities
βββ tests/ # Test suite
βββ .env.example # Example environment variables
βββ Dockerfile # Main Dockerfile
βββ pyproject.toml # Project metadata
βββ README.md # This file
We follow the GitFlow branching model:
- Create a feature branch from
develop
:git checkout -b feature/your-feature
- Make your changes and commit:
git commit -m "Add feature"
- Push your branch:
git push origin feature/your-feature
- Open a Pull Request against the
develop
branch
All PRs must pass CI checks and code review before being merged.
The system supports distributed training using PyTorch's DistributedDataParallel for efficient multi-node training:
# Example Kubernetes configuration in infrastructure/kubernetes/distributed-training.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
parallelism: 4
template:
spec:
containers:
- name: trainer
image: your-registry/mlops-trainer:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: WORLD_SIZE
value: "4"
The A/B testing framework allows comparing multiple models in production:
from mlops_production_system.monitoring import ABTestingFramework
# Set up A/B test between two models
ab_test = ABTestingFramework()
ab_test.create_experiment(
name="pricing_model_comparison",
models=["pricing_model_v1", "pricing_model_v2"],
traffic_split=[0.5, 0.5],
evaluation_metric="conversion_rate"
)
# Get results
results = ab_test.get_results(experiment_name="pricing_model_comparison")
Detect data drift to trigger model retraining:
from mlops_production_system.monitoring import DriftDetector
# Initialize with reference data distribution
detector = DriftDetector(
reference_data="s3://bucket/reference_data.parquet",
features=["feature1", "feature2", "feature3"],
drift_method="wasserstein",
threshold=0.1
)
# Check for drift in new data
drift_detected, drift_metrics = detector.detect(
current_data="s3://bucket/production_data.parquet"
)
if drift_detected:
# Trigger retraining
from mlops_production_system.training import trigger_retraining
trigger_retraining(model_name="my-model")
This project follows security best practices:
- Secrets management via environment variables and Kubernetes secrets
- Regular dependency scanning for vulnerabilities
- Least privilege principle for all service accounts
- Network policies to restrict pod-to-pod communication
- Encryption of data at rest and in transit
This project is licensed under the MIT License - see the LICENSE file for details.
MLOps-Forge was created to demonstrate end-to-end machine learning operations and follows industry best practices for deploying ML models in production environments. Star us on GitHub if you find this project useful!
- ML Framework: scikit-learn, PyTorch
- Feature Store: feast
- Experiment Tracking: MLflow
- API: FastAPI
- Containerization: Docker
- Orchestration: Kubernetes
- CI/CD: GitHub Actions
- Infrastructure as Code: Terraform
- Monitoring: Prometheus, Grafana
- Python 3.9+
- Docker and Docker Compose
- Kubernetes (optional for local development)
-
Clone the repository
git clone https://github.com/TaimoorKhan10/MLOps-Production-System.git cd MLOps-Production-System
-
Create a virtual environment and install dependencies
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration
-
Start the development environment
docker-compose up -d
Access the demo application at http://localhost:8000 after starting the containers.
The demo includes:
- Model training dashboard
- Real-time inference API
- Performance monitoring
Comprehensive documentation is available in the /docs
directory:
Run the test suite:
pytest
docker-compose up -d
cd infrastructure/terraform
terraform init
terraform apply
Access the monitoring dashboard at http://localhost:3000 after deployment.
Contributions are welcome! Please check out our contribution guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.